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Preface 


I  have  long  been  fascinated  by  the  interplay  of  variables  in  multivariate  data  and  by 
the  challenge  of  unraveling  the  effect  of  each  variable.  My  continuing  objective  in 
the  second  edition  has  been  to  present  the  power  and  utility  of  multivariate  analysis 
in  a  highly  readable  format. 

Practitioners  and  researchers  in  all  applied  disciplines  often  measure  several  vari¬ 
ables  on  each  subject  or  experimental  unit.  In  some  cases,  it  may  be  productive  to 
isolate  each  variable  in  a  system  and  study  it  separately.  Typically,  however,  the  vari¬ 
ables  are  not  only  correlated  with  each  other,  but  each  variable  is  influenced  by  the 
other  variables  as  it  affects  a  test  statistic  or  descriptive  statistic.  Thus,  in  many 
instances,  the  variables  are  intertwined  in  such  a  way  that  when  analyzed  individ¬ 
ually  they  yield  little  information  about  the  system.  Using  multivariate  analysis,  the 
variables  can  be  examined  simultaneously  in  order  to  access  the  key  features  of  the 
process  that  produced  them.  The  multivariate  approach  enables  us  to  (1)  explore 
the  joint  performance  of  the  variables  and  (2)  determine  the  effect  of  each  variable 
in  the  presence  of  the  others. 

Multivariate  analysis  provides  both  descriptive  and  inferential  procedures — we 
can  search  for  patterns  in  the  data  or  test  hypotheses  about  patterns  of  a  priori  inter¬ 
est.  With  multivariate  descriptive  techniques,  we  can  peer  beneath  the  tangled  web  of 
variables  on  the  surface  and  extract  the  essence  of  the  system.  Multivariate  inferential 
procedures  include  hypothesis  tests  that  (1)  process  any  number  of  variables  without 
inflating  the  Type  I  error  rate  and  (2)  allow  for  whatever  intercorrelations  the  vari¬ 
ables  possess.  A  wide  variety  of  multivariate  descriptive  and  inferential  procedures 
is  readily  accessible  in  statistical  software  packages. 

My  selection  of  topics  for  this  volume  reflects  many  years  of  consulting  with 
researchers  in  many  fields  of  inquiry.  A  brief  overview  of  multivariate  analysis  is 
given  in  Chapter  1.  Chapter  2  reviews  the  fundamentals  of  matrix  algebra.  Chapters 
3  and  4  give  an  introduction  to  sampling  from  multivariate  populations.  Chapters  5, 
6,  7,  10,  and  1 1  extend  univariate  procedures  with  one  dependent  variable  (including 
f-tests,  analysis  of  variance,  tests  on  variances,  multiple  regression,  and  multiple  cor¬ 
relation)  to  analogous  multivariate  techniques  involving  several  dependent  variables. 
A  review  of  each  univariate  procedure  is  presented  before  covering  the  multivariate 
counterpart.  These  reviews  may  provide  key  insights  the  student  missed  in  previous 
courses. 

Chapters  8,  9,  12,  13,  14,  and  15  describe  multivariate  techniques  that  are  not 
extensions  of  univariate  procedures.  In  Chapters  8  and  9,  we  find  functions  of  the 
variables  that  discriminate  among  groups  in  the  data.  In  Chapters  12  and  13,  we 
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find  functions  of  the  variables  that  reveal  the  basic  dimensionality  and  characteristic 
patterns  of  the  data,  and  we  discuss  procedures  for  finding  the  underlying  latent 
variables  of  a  system.  In  Chapters  14  and  15  (new  in  the  second  edition),  we  give 
methods  for  searching  for  groups  in  the  data,  and  we  provide  plotting  techniques  that 
show  relationships  in  a  reduced  dimensionality  for  various  kinds  of  data. 

In  Appendix  A,  tables  are  provided  for  many  multivariate  distributions  and  tests. 
These  enable  the  reader  to  conduct  an  exact  test  in  many  cases  for  which  software 
packages  provide  only  approximate  tests.  Appendix  B  gives  answers  and  hints  for 
most  of  the  problems  in  the  book. 

Appendix  C  describes  an  ftp  site  that  contains  ( 1 )  all  data  sets  and  (2)  SAS  com¬ 
mand  hies  for  all  examples  in  the  text.  These  command  hies  can  be  adapted  for  use 
in  working  problems  or  in  analyzing  data  sets  encountered  in  applications. 

To  illustrate  multivariate  applications,  I  have  provided  many  examples  and  exer¬ 
cises  based  on  59  real  data  sets  from  a  wide  variety  of  disciplines.  A  practitioner 
or  consultant  in  multivariate  analysis  gains  insights  and  acumen  from  long  experi¬ 
ence  in  working  with  data.  It  is  not  expected  that  a  student  can  achieve  this  kind  of 
seasoning  in  a  one-semester  class.  However,  the  examples  provide  a  good  start,  and 
further  development  is  gained  by  working  problems  with  the  data  sets.  For  example, 
in  Chapters  12  and  13,  the  exercises  cover  several  typical  patterns  in  the  covariance 
or  correlation  matrix.  The  student’s  intuition  is  expanded  by  associating  these  covari¬ 
ance  patterns  with  the  resulting  conhguration  of  the  principal  components  or  factors. 

Although  this  is  a  methods  book,  I  have  included  a  few  derivations.  For  some 
readers,  an  occasional  proof  provides  insights  obtainable  in  no  other  way.  I  hope  that 
instructors  who  do  not  wish  to  use  proofs  will  not  be  deterred  by  their  presence.  The 
proofs  can  be  disregarded  easily  when  reading  the  book. 

My  objective  has  been  to  make  the  book  accessible  to  readers  who  have  taken  as 
few  as  two  statistical  methods  courses.  The  students  in  my  classes  in  multivariate 
analysis  include  majors  in  statistics  and  majors  from  other  departments.  With  the 
applied  researcher  in  mind,  I  have  provided  careful  intuitive  explanations  of  the  con¬ 
cepts  and  have  included  many  insights  typically  available  only  in  journal  articles  or 
in  the  minds  of  practitioners. 

My  overriding  goal  in  preparation  of  this  book  has  been  clarity  of  exposition.  I 
hope  that  students  and  instructors  alike  will  find  this  multivariate  text  more  com¬ 
fortable  than  most.  In  the  final  stages  of  development  of  both  the  first  and  second 
editions,  I  asked  my  students  for  written  reports  on  their  initial  reaction  as  they  read 
each  day's  assignment.  They  made  many  comments  that  led  to  improvements  in  the 
manuscript.  I  will  be  very  grateful  if  readers  will  take  the  time  to  notify  me  of  errors 
or  of  other  suggestions  they  might  have  for  improvements. 

I  have  tried  to  use  standard  mathematical  and  statistical  notation  as  far  as  pos¬ 
sible  and  to  maintain  consistency  of  notation  throughout  the  book.  I  have  refrained 
from  the  use  of  abbreviations  and  mnemonic  devices.  These  save  space  when  one 
is  reading  a  book  page  by  page,  but  they  are  annoying  to  those  using  a  book  as  a 
reference. 

Equations  are  numbered  sequentially  throughout  a  chapter;  for  example,  (3.75) 
indicates  the  75th  numbered  equation  in  Chapter  3.  Tables  and  figures  are  also  num- 
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bered  sequentially  throughout  a  chapter  in  the  form  “Table  3.8“  or  “Figure  3.1.” 
Examples  are  not  numbered  sequentially;  each  example  is  identified  by  the  same 
number  as  the  section  in  which  it  appears  and  is  placed  at  the  end  of  the  section. 

When  citing  references  in  the  text,  I  have  used  the  standard  format  involving  the 
year  of  publication.  For  a  journal  article,  the  year  alone  suffices,  for  example,  Fisher 
(1936).  But  for  books,  I  have  usually  included  a  page  number,  as  in  Seber  (1984, 

p.  216). 

This  is  the  first  volume  of  a  two-volume  set  on  multivariate  analysis.  The  second 
volume  is  entitled  Multivariate  Statistical  Inference  and  Applications  (Wiley,  1998). 
The  two  volumes  are  not  necessarily  sequential;  they  can  be  read  independently.  I 
adopted  the  two- volume  format  in  order  to  (1)  provide  broader  coverage  than  would 
be  possible  in  a  single  volume  and  (2)  offer  the  reader  a  choice  of  approach. 

The  second  volume  includes  proofs  of  many  techniques  covered  in  the  first  13 
chapters  of  the  present  volume  and  also  introduces  additional  topics.  The  present 
volume  includes  many  examples  and  problems  using  actual  data  sets,  and  there  are 
fewer  algebraic  problems.  The  second  volume  emphasizes  derivations  of  the  results 
and  contains  fewer  examples  and  problems  with  real  data.  The  present  volume  has 
fewer  references  to  the  literature  than  the  other  volume,  which  includes  a  careful 
review  of  the  latest  developments  and  a  more  comprehensive  bibliography.  In  this 
second  edition,  I  have  occasionally  referred  the  reader  to  Rencher  (1998)  to  note  that 
added  coverage  of  a  certain  subject  is  available  in  the  second  volume. 

I  am  indebted  to  many  individuals  in  the  preparation  of  the  first  edition.  My  ini¬ 
tial  exposure  to  multivariate  analysis  came  in  courses  taught  by  Rolf  Bargmann  at 
the  University  of  Georgia  and  D.  R.  Jensen  at  Virginia  Tech.  Additional  impetus  to 
probe  the  subtleties  of  this  field  came  from  research  conducted  with  Bruce  Brown 
at  BYU.  I  wish  to  thank  Bruce  Brown,  Deane  Branstetter,  Del  Scott,  Robert  Smidt, 
and  Ingram  Olkin  for  reading  various  versions  of  the  manuscript  and  making  valu¬ 
able  suggestions.  I  am  grateful  to  the  following  students  at  BYU  who  helped  with 
computations  and  typing:  Mitchell  Tolland,  Tawnia  Newton,  Marianne  Matis  Mohr, 
Gregg  Littlefield,  Suzanne  Kimball,  Wendy  Nielsen,  Tiffany  Nordgren,  David  Whit¬ 
ing,  Karla  Wasden,  and  Rachel  Jones. 
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For  the  second  edition,  I  have  added  Chapters  14  and  15,  covering  cluster  analysis, 
multidimensional  scaling,  correspondence  analysis,  and  biplots.  I  also  made  numer¬ 
ous  corrections  and  revisions  (almost  every  page)  in  the  first  13  chapters,  in  an  effort 
to  improve  composition,  readability,  and  clarity.  Many  of  the  first  13  chapters  now 
have  additional  problems. 

I  have  listed  the  data  sets  and  SAS  files  on  the  Wiley  ftp  site  rather  than  on  a 
diskette,  as  in  the  first  edition.  I  have  made  improvements  in  labeling  of  these  files. 

I  am  grateful  to  the  many  readers  who  have  pointed  out  errors  or  made  suggestions 
for  improvements.  The  book  is  better  for  their  caring  and  their  efforts. 
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1  thank  Lonette  Stoddard  and  Candace  B.  McNaughton  for  typing  and  J.  D. 
Williams  for  computer  support.  As  with  my  other  books,  I  dedicate  this  volume  to 
my  wife,  LaRue,  who  has  supplied  much  needed  support  and  encouragement. 

Alvin  C.  Rencher 


Acknowledgments 


I  thank  the  authors,  editors,  and  owners  of  copyrights  for  permission  to  reproduce 
the  following  materials: 

•  Figure  3.8  and  Table  3.2,  Kleiner  and  Hartigan  (1981),  Reprinted  by  permission 
of  Journal  of  the  American  Statistical  Association 

•  Table  3.3,  Kramer  and  Jensen  (1969a),  Reprinted  by  permission  of  Journal  of 
Quality  Technology 

•  Table  3.4,  Reaven  and  Miller  (1979),  Reprinted  by  permission  of  Diabetologia 

•  Table  3.5,  Timm  (1975),  Reprinted  by  permission  of  Elsevier  North-Holland 
Publishing  Company 

•  Table  3.6,  Elston  and  Grizzle  (1962),  Reprinted  by  permission  of  Biometrics 

•  Table  3.7,  Frets  (1921),  Reprinted  by  permission  of  Genetica 

•  Table  3.8,  O’Sullivan  and  Mahan  (1966),  Reprinted  by  permission  of  American 
Journal  of  Clinical  Nutrition 

•  Table  4.3,  Royston  (1983),  Reprinted  by  permission  of  Applied  Statistics 

•  Table  5.1,  Beall  (1945),  Reprinted  by  permission  of  Psychometrika 

•  Table  5.2,  Hummel  and  Sligo  (1971),  Reprinted  by  permission  of  Psychological 
Bulletin 

•  Table  5.3,  Kramer  and  Jensen  (1969b),  Reprinted  by  permission  of  Journal  of 
Quality  Technology 

•  Table  5.5,  Lubischew  (1962),  Reprinted  by  permission  of  Biometrics 

•  Table  5.6,  Travers  (1939),  Reprinted  by  permission  of  Psychometrika 

•  Table  5.7,  Andrews  and  Herzberg  (1985),  Reprinted  by  permission  of  Springer- 
Verlag 

•  Table  5.8,  Tintner  (1946),  Reprinted  by  permission  of  Journal  of  the  American 
Statistical  Association 

•  Table  5.9,  Kramer  (1972),  Reprinted  by  permission  of  the  author 

•  Table  5.10,  Cameron  and  Pauling  (1978),  Reprinted  by  permission  of  National 
Academy  of  Science 


xix 


XX 


ACKNOWLEDGMENTS 


•  Table  6.2,  Andrews  and  Herzberg  (1985),  Reprinted  by  permission  of  Springer- 
Verlag 

•  Table  6.3,  Rencher  and  Scott  (1990),  Reprinted  by  permission  of  Communica¬ 
tions  in  Statistics:  Simulation  and  Computation 

•  Table  6.6,  Posten  (1962),  Reprinted  by  permission  of  the  author 

•  Table  6.8,  Crowder  and  Hand  (1990,  pp.  21-29),  Reprinted  by  permission  of 
Routledge  Chapman  and  Hall 

•  Table  6.12,  Cochran  and  Cox  (1957),  Timm  (1980),  Reprinted  by  permission 
of  John  Wiley  and  Sons  and  Elsevier  North-Holland  Publishing  Company 

•  Table  6.14,  Timm  (1980),  Reprinted  by  permission  of  Elsevier  North-Holland 
Publishing  Company 

•  Table  6.16,  Potthoff  and  Roy  (1964),  Reprinted  by  permission  of  Biometrika 
Trustees 

•  Table  6. 17,  Baten,  Tack,  and  Baeder  (1958),  Reprinted  by  permission  of  Quality 
Progress 

•  Table  6.18,  Keuls  et  al.  (1984),  Reprinted  by  permission  of  Scientia  Horticul- 
turae 

•  Table  6.19,  Burdick  (1979),  Reprinted  by  permission  of  the  author 

•  Table  6.20,  Box  (1950),  Reprinted  by  permission  of  Biometrics 

•  Table  6.21,  Rao  (1948),  Reprinted  by  permission  of  Biometrika  Trustees 

•  Table  6.22,  Cameron  and  Pauling  (1978),  Reprinted  by  permission  of  National 
Academy  of  Science 

•  Table  6.23,  Williams  and  Izenman  (1989),  Reprinted  by  permission  of  Colorado 
State  University 

•  Table  6.24,  Beauchamp  and  Hoel  (1974),  Reprinted  by  permission  of  Journal 
of  Statistical  Computation  and  Simulation 

•  Table  6.25,  Box  (1950),  Reprinted  by  permission  of  Biometrics 

•  Table  6.26,  Grizzle  and  Allen  (1969),  Reprinted  by  permission  of  Biometrics 

•  Table  6.27,  Crepeau  et  al.  (1985),  Reprinted  by  permission  of  Biometrics 

•  Table  6.28,  Zerbe  (1979a),  Reprinted  by  permission  of  Journal  of  the  American 
Statistical  Association 

•  Table  6.29,  Timm  (1980),  Reprinted  by  permission  of  Elsevier  North-Holland 
Publishing  Company 

•  Table  7.1,  Siotani  et  al.  (1963),  Reprinted  by  permission  of  the  Institute  of  Sta¬ 
tistical  Mathematics 


ACKNOWLEDGMENTS 


XXI 


•  Table  7.2,  Reprinted  by  permission  of  R.  J.  Freund 

•  Table  8.1,  Kramer  and  Jensen  (1969a),  Reprinted  by  permission  of  Journal  of 
Quality  Technology 

•  Table  8.3,  Reprinted  by  permission  of  G.  R.  Bryce  and  R.  M.  Barker 

•  Table  10.1,  Box  and  Youle  (1955),  Reprinted  by  permission  of  Biometrics 

•  Tables  12.2,  12.3,  and  12.4,  Jeffers  (1967),  Reprinted  by  permission  of  Applied 
Statistics 

•  Table  13.1,  Brown  et  al.  (1984),  Reprinted  by  permission  of  the  Journal  of 
Pascal,  Ada,  and  Modula 

•  Correlation  matrix  in  Example  13.6,  Brown,  Strong,  and  Rencher  (1973), 
Reprinted  by  permission  of  The  Journal  of  the  Acoustical  Society  of  America 

•  Table  14.1,  Hartigan  (1975),  Reprinted  by  permission  of  John  Wiley  and  Sons 

•  Table  14.3,  Dawkins  (1989),  Reprinted  by  permission  of  The  American  Statis¬ 
tician 

•  Table  14.7,  Hand  et  al.  (1994),  Reprinted  by  permission  of  D.  J.  Hand 

•  Table  14.12,  Sokol  and  Rohlf  (1981),  Reprinted  by  permission  of  W.  H.  Free¬ 
man  and  Co. 

•  Table  14.13,  Hand  et  al.  (1994),  Reprinted  by  permission  of  D.  J.  Hand 

•  Table  15.1,  Kruskal  and  Wish  (1978),  Reprinted  by  permission  of  Sage  Publi¬ 
cations 

•  Tables  15.2  and  15.5,  Hand  et  al.  (1994),  Reprinted  by  permission  of  D.  J.  Hand 

•  Table  15.13,  Edwards  and  Kreiner  (1983),  Reprinted  by  permission  of  Biometrika 

•  Table  15.15,  Hand  et  al.  (1994),  Reprinted  by  permission  of  D.  J.  Hand 

•  Table  15.16,  Everitt  (1987),  Reprinted  by  permission  of  the  author 

•  Table  15.17,  Andrews  and  Herzberg  (1985),  Reprinted  by  permission  of 
Springer  Verlag 

•  Table  15.18,  Clausen  (1988),  Reprinted  by  permission  of  Sage  Publications 

•  Table  15.19,  Andrews  and  Herzberg  (1985),  Reprinted  by  permission  of 
Springer  Verlag 

•  Table  A.  1 ,  Mulholland  (1977),  Reprinted  by  permission  of  Biometrika  Trustees 

•  Table  A. 2,  D'Agostino  and  Pearson  (1973),  Reprinted  by  permission  of 
Biometrika  Trustees 

•  Table  A.3,  D'Agostino  and  Tietjen  (1971),  Reprinted  by  permission  of  Biometrika 
Trustees 


XXII 


ACKNOWLEDGMENTS 


•  Table  A.4,  D’Agostino  (1972),  Reprinted  by  permission  of  Biometrika  Trustees 

•  Table  A. 5,  Mardia  (1970,  1974),  Reprinted  by  permission  of  Biometrika 
Trustees 

•  Table  A.6,  Barnett  and  Lewis  (1978),  Reprinted  by  permission  of  John  Wiley 
and  Sons 

•  Table  A. 7,  Kramer  and  Jensen  (1969a),  Reprinted  by  permission  of  Journal  of 
Quality  Technology 

•  Table  A. 8,  Bailey  (1977),  Reprinted  by  permission  of  Journal  of  the  American 
Statistical  Association 

•  Table  A. 9,  Wall  (1967),  Reprinted  by  permission  of  the  author,  Albuquerque, 
NM 

•  Table  A. 10,  Pearson  and  Hartley  (1972)  and  Pillai  (1964,  1965),  Reprinted  by 
permission  of  Biometrika  Trustees 

•  Table  A.l  1,  Schuurmann  et  al.  (1975),  Reprinted  by  permission  of  Journal  of 
Statistical  Computation  and  Simulation 

•  Table  A. 12,  Davis  (1970a, b,  1980),  Reprinted  by  permission  of  Biometrika 
Trustees 

•  Table  A. 13,  Kleinbaum,  Kupper,  and  Muller  (1988),  Reprinted  by  permission 
of  PWS-KENT  Publishing  Company 

•  Table  A.  14,  Lee  et  al.  (1977),  Reprinted  by  permission  of  Elsevier  North- 
Holland  Publishing  Company 

•  Table  A.  15,  Mathai  and  Katiyar  (1979),  Reprinted  by  permission  of  Biometrika 
Trustees 


CHAPTER  1 


Introduction 


1.1  WHY  MULTIVARIATE  ANALYSIS? 

Multivariate  analysis  consists  of  a  collection  of  methods  that  can  be  used  when  sev¬ 
eral  measurements  are  made  on  each  individual  or  object  in  one  or  more  samples.  We 
will  refer  to  the  measurements  as  variables  and  to  the  individuals  or  objects  as  units 
(research  units,  sampling  units,  or  experimental  units)  or  obsen’ations.  In  practice, 
multivariate  data  sets  are  common,  although  they  are  not  always  analyzed  as  such. 
But  the  exclusive  use  of  univariate  procedures  with  such  data  is  no  longer  excusable, 
given  the  availability  of  multivariate  techniques  and  inexpensive  computing  power 
to  carry  them  out. 

Historically,  the  bulk  of  applications  of  multivariate  techniques  have  been  in  the 
behavioral  and  biological  sciences.  However,  interest  in  multivariate  methods  has 
now  spread  to  numerous  other  fields  of  investigation.  For  example,  I  have  collab¬ 
orated  on  multivariate  problems  with  researchers  in  education,  chemistry,  physics, 
geology,  engineering,  law,  business,  literature,  religion,  public  broadcasting,  nurs¬ 
ing,  mining,  linguistics,  biology,  psychology,  and  many  other  fields.  Table  1.1  shows 
some  examples  of  multivariate  observations. 

The  reader  will  notice  that  in  some  cases  all  the  variables  are  measured  in  the  same 
scale  (see  1  and  2  in  Table  1.1).  In  other  cases,  measurements  are  in  different  scales 
(see  3  in  Table  1.1).  In  a  few  techniques,  such  as  profile  analysis  (Sections  5.9  and 
6.8),  the  variables  must  be  commensurate,  that  is,  similar  in  scale  of  measurement; 
however,  most  multivariate  methods  do  not  require  this. 

Ordinarily  the  variables  are  measured  simultaneously  on  each  sampling  unit.  Typ¬ 
ically,  these  variables  are  correlated.  If  this  were  not  so,  there  would  be  little  use  for 
many  of  the  techniques  of  multivariate  analysis.  We  need  to  untangle  the  overlapping 
information  provided  by  correlated  variables  and  peer  beneath  the  surface  to  see  the 
underlying  structure.  Thus  the  goal  of  many  multivariate  approaches  is  simplifica¬ 
tion.  We  seek  to  express  what  is  going  on  in  terms  of  a  reduced  set  of  dimensions. 
Such  multivariate  techniques  are  exploratory,  they  essentially  generate  hypotheses 
rather  than  test  them. 

On  the  other  hand,  if  our  goal  is  a  formal  hypothesis  test,  we  need  a  technique  that 
will  (1)  allow  several  variables  to  be  tested  and  still  preserve  the  significance  level 
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Table  1.1.  Examples  of  Multivariate  Data 


Units 

Variables 

1. 

Students 

Several  exam  scores  in  a  single  course 

2. 

Students 

Grades  in  mathematics,  history,  music,  art,  physics 

3. 

People 

Height,  weight,  percentage  of  body  fat,  resting  heart 
rate 

4. 

Skulls 

Length,  width,  cranial  capacity 

5. 

Companies 

Expenditures  for  advertising,  labor,  raw  materials 

6. 

Manufactured  items 

Various  measurements  to  check  on  compliance  with 
specifications 

7. 

Applicants  for  bank  loans 

Income,  education  level,  length  of  residence,  savings 
account,  current  debt  load 

8. 

Segments  of  literature 

Sentence  length,  frequency  of  usage  of  certain  words 
and  of  style  characteristics 

9. 

Human  hairs 

Composition  of  various  elements 

10. 

Birds 

Lengths  of  various  bones 

and  (2)  do  this  for  any  intercorrelation  structure  of  the  variables.  Many  such  tests  are 
available. 

As  the  two  preceding  paragraphs  imply,  multivariate  analysis  is  concerned  gener¬ 
ally  with  two  areas,  descriptive  and  inferential  statistics.  In  the  descriptive  realm,  we 
often  obtain  optimal  linear  combinations  of  variables.  The  optimality  criterion  varies 
from  one  technique  to  another,  depending  on  the  goal  in  each  case.  Although  linear 
combinations  may  seem  too  simple  to  reveal  the  underlying  structure,  we  use  them 
for  two  obvious  reasons:  (1)  they  have  mathematical  tractability  (linear  approxima¬ 
tions  are  used  throughout  all  science  for  the  same  reason)  and  (2)  they  often  perform 
well  in  practice.  These  linear  functions  may  also  be  useful  as  a  follow-up  to  infer¬ 
ential  procedures.  When  we  have  a  statistically  significant  test  result  that  compares 
several  groups,  for  example,  we  can  find  the  linear  combination  (or  combinations) 
of  variables  that  led  to  rejection  of  the  hypothesis.  Then  the  contribution  of  each 
variable  to  these  linear  combinations  is  of  interest. 

In  the  inferential  area,  many  multivariate  techniques  are  extensions  of  univariate 
procedures.  In  such  cases,  we  review  the  univariate  procedure  before  presenting  the 
analogous  multivariate  approach. 

Multivariate  inference  is  especially  useful  in  curbing  the  researcher’s  natural  ten¬ 
dency  to  read  too  much  into  the  data.  Total  control  is  provided  for  experimentwise 
error  rate;  that  is,  no  matter  how  many  variables  are  tested  simultaneously,  the  value 
of  a  (the  significance  level)  remains  at  the  level  set  by  the  researcher. 

Some  authors  warn  against  applying  the  common  multivariate  techniques  to  data 
for  which  the  measurement  scale  is  not  interval  or  ratio.  It  has  been  found,  however, 
that  many  multivariate  techniques  give  reliable  results  when  applied  to  ordinal  data. 

For  many  years  the  applications  lagged  behind  the  theory  because  the  compu¬ 
tations  were  beyond  the  power  of  the  available  desktop  calculators.  However,  with 
modern  computers,  virtually  any  analysis  one  desires,  no  matter  how  many  variables 
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or  observations  are  involved,  can  be  quickly  and  easily  carried  out.  Perhaps  it  is  not 
premature  to  say  that  multivariate  analysis  has  come  of  age. 


1.2  PREREQUISITES 

The  mathematical  prerequisite  for  reading  this  book  is  matrix  algebra.  Calculus  is  not 
used  [with  a  brief  exception  in  equation  (4.29)].  But  the  basic  tools  of  matrix  algebra 
are  essential,  and  the  presentation  in  Chapter  2  is  intended  to  be  sufficiently  complete 
so  that  the  reader  with  no  previous  experience  can  master  matrix  manipulation  up  to 
the  level  required  in  this  book. 

The  statistical  prerequisites  are  basic  familiarity  with  the  normal  distribution, 
r-tests,  confidence  intervals,  multiple  regression,  and  analysis  of  variance.  These 
techniques  are  reviewed  as  each  is  extended  to  the  analogous  multivariate  procedure. 

This  is  a  multivariate  methods  text.  Most  of  the  results  are  given  without  proof.  In 
a  few  cases  proofs  are  provided,  but  the  major  emphasis  is  on  heuristic  explanations. 
Our  goal  is  an  intuitive  grasp  of  multivariate  analysis,  in  the  same  mode  as  other 
statistical  methods  courses.  Some  problems  are  algebraic  in  nature,  but  the  majority 
involve  data  sets  to  be  analyzed. 


1.3  OBJECTIVES 

I  have  formulated  three  objectives  that  I  hope  this  book  will  achieve  for  the  reader. 
These  objectives  are  based  on  long  experience  teaching  a  course  in  multivariate 
methods,  consulting  on  multivariate  problems  with  researchers  in  many  fields,  and 
guiding  statistics  graduate  students  as  they  consulted  with  similar  clients. 

The  first  objective  is  to  gain  a  thorough  understanding  of  the  details  of  various 
multivariate  techniques,  their  purposes,  their  assumptions,  their  limitations,  and  so 
on.  Many  of  these  techniques  are  related;  yet  they  differ  in  some  essential  ways.  We 
emphasize  these  similarities  and  differences. 

The  second  objective  is  to  be  able  to  select  one  or  more  appropriate  techniques  for 
a  given  multivariate  data  set.  Recognizing  the  essential  nature  of  a  multivariate  data 
set  is  the  first  step  in  a  meaningful  analysis.  We  introduce  basic  types  of  multivariate 
data  in  Section  1 .4. 

The  third  objective  is  to  be  able  to  interpret  the  results  of  a  computer  analysis 
of  a  multivariate  data  set.  Reading  the  manual  for  a  particular  program  package  is 
not  enough  to  make  an  intelligent  appraisal  of  the  output.  Achievement  of  the  first 
objective  and  practice  on  data  sets  in  the  text  should  help  achieve  the  third  objective. 


1.4  BASIC  TYPES  OF  DATA  AND  ANALYSIS 

We  will  list  four  basic  types  of  (continuous)  multivariate  data  and  then  briefly 
describe  some  possible  analyses.  Some  writers  would  consider  this  an  oversimpli- 
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fication  and  might  prefer  elaborate  tree  diagrams  of  data  structure.  However,  many 
data  sets  can  fit  into  one  of  these  categories,  and  the  simplicity  of  this  structure 
makes  it  easier  to  remember.  The  four  basic  data  types  are  as  follows: 

1.  A  single  sample  with  several  variables  measured  on  each  sampling  unit  (sub¬ 
ject  or  object); 

2.  A  single  sample  with  two  sets  of  variables  measured  on  each  unit; 

3.  Two  samples  with  several  variables  measured  on  each  unit; 

4.  Three  or  more  samples  with  several  variables  measured  on  each  unit. 

Each  data  type  has  extensions,  and  various  combinations  of  the  four  are  possible. 
A  few  examples  of  analyses  for  each  case  are  as  follows: 

1.  A  single  sample  with  several  variables  measured  on  each  sampling  unit: 

(a)  Test  the  hypothesis  that  the  means  of  the  variables  have  specified  values. 

(b)  Test  the  hypothesis  that  the  variables  are  uncorrelated  and  have  a  common 
variance. 

(c)  Find  a  small  set  of  linear  combinations  of  the  original  variables  that  sum¬ 
marizes  most  of  the  variation  in  the  data  (principal  components). 

(d)  Express  the  original  variables  as  linear  functions  of  a  smaller  set  of  under¬ 
lying  variables  that  account  for  the  original  variables  and  their  intercorre¬ 
lations  (factor  analysis). 

2.  A  single  sample  with  two  sets  of  variables  measured  on  each  unit: 

(a)  Determine  the  number,  the  size,  and  the  nature  of  relationships  between 
the  two  sets  of  variables  (canonical  correlation).  For  example,  you  may 
wish  to  relate  a  set  of  interest  variables  to  a  set  of  achievement  variables. 
How  much  overall  correlation  is  there  between  these  two  sets? 

(b)  Find  a  model  to  predict  one  set  of  variables  from  the  other  set  (multivariate 
multiple  regression). 

3.  Two  samples  with  several  variables  measured  on  each  unit: 

(a)  Compare  the  means  of  the  variables  across  the  two  samples  (Hotelling’s 
T2-test). 

(b)  Find  a  linear  combination  of  the  variables  that  best  separates  the  two  sam¬ 
ples  (discriminant  analysis). 

(c)  Find  a  function  of  the  variables  that  accurately  allocates  the  units  into  the 
two  groups  (classification  analysis). 

4.  Three  or  more  samples  with  several  variables  measured  on  each  unit: 

(a)  Compare  the  means  of  the  variables  across  the  groups  (multivariate  anal¬ 
ysis  of  variance). 

(b)  Extension  of  3(b)  to  more  than  two  groups. 

(c)  Extension  of  3(c)  to  more  than  two  groups. 
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Matrix  Algebra 


2.1  INTRODUCTION 

This  chapter  introduces  the  basic  elements  of  matrix  algebra  used  in  the  remainder 
of  this  book.  It  is  essentially  a  review  of  the  requisite  matrix  tools  and  is  not  intended 
to  be  a  complete  development.  However,  it  is  sufficiently  self-contained  so  that  those 
with  no  previous  exposure  to  the  subject  should  need  no  other  reference.  Anyone 
unfamiliar  with  matrix  algebra  should  plan  to  work  most  of  the  problems  entailing 
numerical  illustrations.  It  would  also  be  helpful  to  explore  some  of  the  problems 
involving  general  matrix  manipulation. 

With  the  exception  of  a  few  derivations  that  seemed  instructive,  most  of  the  results 
are  given  without  proof.  Some  additional  proofs  are  requested  in  the  problems.  For 
the  remaining  proofs,  see  any  general  text  on  matrix  theory  or  one  of  the  specialized 
matrix  texts  oriented  to  statistics,  such  as  Graybill  (1969),  Searle  (1982),  or  Harville 
(1997). 


2.2  NOTATION  AND  BASIC  DEFINITIONS 
2.2.1  Matrices,  Vectors,  and  Scalars 

A  matrix  is  a  rectangular  or  square  array  of  numbers  or  variables  arranged  in  rows 
and  columns.  We  use  uppercase  boldface  letters  to  represent  matrices.  All  entries  in 
matrices  will  be  real  numbers  or  variables  representing  real  numbers.  The  elements 
of  a  matrix  are  displayed  in  brackets.  For  example,  the  ACT  score  and  GPA  for  three 
students  can  be  conveniently  listed  in  the  following  matrix: 


A  = 


23  3.54  \ 

29  3.81  .  (2.1) 

18  2.75  ) 


The  elements  of  A  can  also  be  variables,  representing  possible  values  of  ACT  and 
GPA  for  three  students: 
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(an  an  \ 

«21  «22  I  •  (2.2) 

031  032  / 

In  this  double-subscript  notation  for  the  elements  of  a  matrix,  the  first  subscript  indi¬ 
cates  the  row;  the  second  identifies  the  column.  The  matrix  A  in  (2.2)  can  also  be 
expressed  as 

A  =  (o,7),  (2.3) 


where  a, j  is  a  general  element. 

With  three  rows  and  two  columns,  the  matrix  A  in  (2.1)  or  (2.2)  is  said  to  be 
3  x  2.  In  general,  if  a  matrix  A  has  n  rows  and  p  columns,  it  is  said  to  be  n  x  p. 
Alternatively,  we  say  the  size  of  A  is  n  x  p. 

A  vector  is  a  matrix  with  a  single  column  or  row.  The  following  could  be  the  test 
scores  of  a  student  in  a  course  in  multivariate  analysis: 


/  98  \ 
86 
93 

V  97  / 


(2.4) 


Variable  elements  in  a  vector  can  be  identified  by  a  single  subscript: 

(  x\  ^ 

'=  %  ■  <25' 

\X4  J 

We  use  lowercase  boldface  letters  for  column  vectors.  Row  vectors  are  expressed  as 


x'  =  (xi,  X2,  X3,  X4)  or  as  x'  =  (,ri  X2  X3  X4), 


where  x'  indicates  the  transpose  of  x.  The  transpose  operation  is  defined  in  Sec¬ 
tion  2.2.3. 

Geometrically,  a  vector  with  p  elements  identifies  a  point  in  a  p-dimensional 
space.  The  elements  in  the  vector  are  the  coordinates  of  the  point.  In  (2.35)  in  Sec¬ 
tion  2.3.3,  we  define  the  distance  from  the  origin  to  the  point.  In  Section  3.12,  we 
define  the  distance  between  two  vectors.  In  some  cases,  we  will  be  interested  in  a 
directed  line  segment  or  arrow  from  the  origin  to  the  point. 

A  single  real  number  is  called  a  scalar,  to  distinguish  it  from  a  vector  or  matrix. 
Thus  2,  —4,  and  125  are  scalars.  A  variable  representing  a  scalar  is  usually  denoted 
by  a  lowercase  nonbolded  letter,  such  as  a  =  5.  A  product  involving  vectors  and 
matrices  may  reduce  to  a  matrix  of  size  lxl,  which  then  becomes  a  scalar. 
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2.2.2  Equality  of  Vectors  and  Matrices 


Two  matrices  are  equal  if  they  are  the  same  size  and  the  elements  in  corresponding 
positions  are  equal.  Thus  if  A  =  (a/j)  and  B  =  ( bt] ) ,  then  A  =  B  if  a,-j  —  b,j  for  all 
i  and  j .  For  example,  let 


A  = 

C  = 


3  -2  4\ 

1  3  7/’ 

3  -2  4  \ 

1  3  7/’ 


B  = 

D  = 


3  1 

-2  3 

4  7 
3  -2 

1  3 


Then  A  =  C.  But  even  though  A  and  B  have  the  same  elements,  A/B  because  the 
two  matrices  are  not  the  same  size.  Likewise,  A/D  because  c/23  ^  c/23.  Thus  two 
matrices  of  the  same  size  are  unequal  if  they  differ  in  a  single  position. 


2.2.3  Transpose  and  Symmetric  Matrices 

The  transpose  of  a  matrix  A,  denoted  by  A',  is  obtained  from  A  by  interchanging 
rows  and  columns.  Thus  the  columns  of  A'  are  the  rows  of  A,  and  the  rows  of  A' 
are  the  columns  of  A.  The  following  examples  illustrate  the  transpose  of  a  matrix  or 
vector: 


The  transpose  operation  does  not  change  a  scalar,  since  it  has  only  one  row  and 
one  column. 

If  the  transpose  operator  is  applied  twice  to  any  matrix,  the  result  is  the  original 
matrix: 


(A')'  =  A.  (2.6) 

If  the  transpose  of  a  matrix  is  the  same  as  the  original  matrix,  the  matrix  is  said  to 
be  symmetric,  that  is,  A  is  symmetric  if  A  =  A'.  For  example, 

/  3  -2  4  \  /  3  -2  4  \ 

A  =  -2  10  -7  ,  A'  =  -2  10  -7  . 

V  4  -7  9  /  \  4  —7  9  / 


Clearly,  all  symmetric  matrices  are  square. 
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2.2.4  Special  Matrices 

The  diagonal  of  a  p  x  p  square  matrix  A  consists  of  the  elements  an,  022-  . . .  ,  app. 
For  example,  in  the  matrix 


A  = 


5 

7 

-6 


the  elements  5,  9,  and  1  lie  on  the  diagonal.  If  a  matrix  contains  zeros  in  all  off- 
diagonal  positions,  it  is  said  to  be  a  diagonal  matrix.  An  example  of  a  diagonal 
matrix  is 


(  10 
0 
0 

V  0 


0  0  0  \ 

-3  0  0 
0  0  0 
0  0  7  / 


This  matrix  can  also  be  denoted  as 


D  =  diag(  10, -3,0,7).  (2.7) 

A  diagonal  matrix  can  be  formed  from  any  square  matrix  by  replacing  off- 
diagonal  elements  by  0’s.  This  is  denoted  by  diag(A).  Thus  for  the  preceding  matrix 
A,  we  have 


/  5  —2  4  \  /  5  0  0  \ 

diag(A)  =  diag  7  93  =  090  .  (2.8) 

\  -6  81/  \  0  0  1  / 

A  diagonal  matrix  with  a  1  in  each  diagonal  position  is  called  an  identity  matrix 
and  is  denoted  by  I.  For  example,  a  3  x  3  identity  matrix  is  given  by 

/  1  0  0 

1=  0  1  0 

\  0  0  1 


An  upper  triangular  matrix  is  a  square  matrix  with  zeros  below  the  diagonal,  such 
as 


/  8 

3 

4 

7  ^ 

0 

0 

-2 

3 

0 

0 

5 

1 

l  0 

0 

0 

6  ) 

(2.10) 


A  lower  triangular  matrix  is  defined  similarly. 
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A  vector  of  l’s  is  denoted  by  j: 


j 


A  square  matrix  of  1  ’s  is  denoted  by  J 

1  1  M 

111.  (2.12) 
111/ 

Finally,  we  denote  a  vector  of  zeros  by  0  and  a  matrix  of  zeros  by  O.  For  example, 

/  0  \  /  0  0  0  0  \ 

0=  0,  0=  0  0  0  0  .  (2.13) 

\  0  /  \  0  0  0  0  / 

2.3  OPERATIONS 

2.3.1  Summation  and  Product  Notation 

For  completeness,  we  review  the  standard  mathematical  notation  for  sums  and  prod¬ 
ucts.  The  sum  of  a  sequence  of  numbers  a\ ,  ai, . . .  ,  a„  is  indicated  by 

n 

'y  Clj  —  a  I  +  02  +  •  •  '  +  0,1 . 

7=1 

If  the  n  numbers  are  all  the  same,  then  a  —  a  +  a  +  ■  ■  ■  +  a  =  na.  The  sum  of 
all  the  numbers  in  an  array  with  double  subscripts,  such  as 

flu  a  12  fll3 

fl2l  fl22  «23> 

is  indicated  by 

2  3 

EE  dij  —  «U  +  «12  +  A13  +  A21  +  «22  +  023- 
i=l  7=1 

This  is  sometimes  abbreviated  to 

2  3 

E  E  a'j  ~  E  a’j  ■ 

i=l  7=1  ij 
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The  product  of  a  sequence  of  numbers  a\,ai, .. .  ,  an  is  indicated  by 

n 

Y\a'  =  (ai)(a2)  •  •  •  (a„). 

i=i 

If  the  n  numbers  are  all  equal,  the  product  becomes  n"=1  a  —  (a)(a)  ■  ■  ■  (a)  =  a". 

2.3.2  Addition  of  Matrices  and  Vectors 

If  two  matrices  (or  two  vectors)  are  the  same  size,  their  sum  is  found  by  adding 
corresponding  elements;  that  is,  if  A  is  n  x  p  and  B  is  n  x  p.  then  C  =  A  +  B  is  also 
n  x  p  and  is  found  as  (c/j)  =  (ciij  +  bij).  For  example, 


Similarly,  the  difference  between  two  matrices  or  two  vectors  of  the  same  size  is 
found  by  subtracting  corresponding  elements.  Thus  C  =  A  —  B  is  found  as  (c,y )  = 
{ciij  —  bij).  For  example, 

(3  9  -4) -(5  -4  2)  =  (—2  13  -6). 

If  two  matrices  are  identical,  their  difference  is  a  zero  matrix;  that  is,  A  =  B  implies 
A  —  B  =  O.  For  example, 

/  3  -2  4\_/3  -2  4  \  _  /  0  0  0\ 

\  6  7  5  )  y  6  7  5  j  _  (  0  0  0  j' 

Matrix  addition  is  commutative: 

A  +  B  =  B  +  A.  (2.14) 

The  transpose  of  the  sum  (difference)  of  two  matrices  is  the  sum  (difference)  of 
the  transposes: 


(A  +  B)'  =  A'  +  B', 

(2.15) 

(A-B)'  =  A'  -B', 

(2.16) 

(x  +  y)' =  x'  +  y', 

(2.17) 

1 

II 

1 

(2.18) 
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2.3.3  Multiplication  of  Matrices  and  Vectors 

In  order  for  the  product  AB  to  be  defined,  the  number  of  columns  in  A  must  be  the 
same  as  the  number  of  rows  in  B,  in  which  case  A  and  B  are  said  to  be  conformable. 
Then  the  (i  /)th  element  of  C  =  AB  is 

Cjj  —  'y  'cijkbkj.  (2.19) 

k 


Thus  Cij  is  the  sum  of  products  of  the  /  th  row  of  A  and  the  /th  column  of  B.  We 
therefore  multiply  each  row  of  A  by  each  column  of  B,  and  the  size  of  AB  consists 
of  the  number  of  rows  of  A  and  the  number  of  columns  of  B.  Thus,  if  A  is  n  x  m  and 
B  is  m  x  p.  then  C  =  AB  is  n  x  p.  For  example,  if 


A  - 


/  2  1  3  \ 

4  6  5 
7  2  3 

V  1  3  2  / 


and  B  = 


then 


C  =  AB  = 


2- 

1  +  1- 

2  +  3-3 

2-4+1-6  +  3-  8  \ 

4- 

1  +  6- 

2  +  5-3 

4-4  +  6-  6  +  5-  8 

7  • 

1  +  2- 

2  +  3-3 

7-4  +  2-  6  +  3-  8 

1  • 

1  +  3- 

2  +  2-3 

1-4+3-6  +  2-  8  ) 

13 

38  \ 

31 

92 

20 

64 

13 

38  ) 

Note  that  A  is  4  x  3,  B  is  3  x  2,  and  AB  is  4  x  2.  In  this  case,  AB  is  of  a  different 
size  than  either  A  or  B. 

If  A  and  B  are  both  n  x  n,  then  AB  is  also  n  x  n.  Clearly,  A2  is  defined  only  if  A 
is  a  square  matrix. 

In  some  cases  AB  is  defined,  but  BA  is  not  defined.  In  the  preceding  example,  BA 
cannot  be  found  because  B  is  3  x  2  and  A  is  4  x  3  and  a  row  of  B  cannot  be  multiplied 
by  a  column  of  A.  Sometimes  AB  and  BA  are  both  defined  but  are  different  in  size. 
For  example,  if  A  is  2  x  4  and  B  is  4  x  2,  then  AB  is  2  x  2  and  BA  is  4  x  4.  If  A  and 
B  are  square  and  the  same  size,  then  AB  and  BA  are  both  defined.  However, 


AB  /  BA, 


except  for  a  few  special  cases.  For  example,  let 


A  = 


1 

2 


-2 

5 


(2.20) 
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Then 


*»-(:?! iy  »)■ 

Thus  we  must  be  careful  to  specify  the  order  of  multiplication.  If  we  wish  to  multiply 
both  sides  of  a  matrix  equation  by  a  matrix,  we  must  multiply  on  the  left  or  on  the 
right  and  be  consistent  on  both  sides  of  the  equation. 

Multiplication  is  distributive  over  addition  or  subtraction: 


A(B  +  €)  =  AB  +  AC, 

(2.21) 

A(B  -  C)  =  AB  -  AC, 

(2.22) 

(A  +  B)C  =  AC  +  BC, 

(2.23) 

(A  -  B)C  =  AC  -  BC. 

(2.24) 

Note  that,  in  general,  because  of  (2.20), 

A(B  +  C)  ^  BA  +  CA. 

(2.25) 

Using  the  distributive  law,  we  can  expand  products  such  as  (A  —  B)(C  —  D)  to 
obtain 

(A  -  B)(C  -  D)  =  (A  -  B)C  -  (A  -  B)D  [by  (2.22)] 

=  AC  -  BC  -  AD  +  BD  [by  (2.24)] .  (2.26) 

The  transpose  of  a  product  is  the  product  of  the  transposes  in  reverse  order: 

(AB)'  =  B'A'.  (2.27) 

Note  that  (2.27)  holds  as  long  as  A  and  B  are  conformable.  They  need  not  be  square. 

Multiplication  involving  vectors  follows  the  same  rules  as  for  matrices.  Suppose 
A  is  n  x  p,  a  is  p  x  1,  b  is  p  x  1,  and  c  is  n  x  1.  Then  some  possible  products  are 
Ab,  c'A,  a'b,  b'a,  and  ah'.  For  example,  let 
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c'A  =  (2  -5) 


3-2  4 
1  3  5 


=  (1  -19  -17), 


c'Ab  =  (2  -5) 


3-2  4 
1  3  5 


3  =  (2  -5) 


16 

31 


a'b  =  (1  -2  3)  |  3  |  = 

4 


1 

b'a  =  (2  3  4)1  -2  I  = 


-123, 


Note  that  Ab  is  a  column  vector,  c'A  is  a  row  vector,  c'Ab  is  a  scalar,  and  a'b  =  b'a. 
The  triple  product  c'Ab  was  obtained  as  c'(Ab).  The  same  result  would  be  obtained 
if  we  multiplied  in  the  order  (c'A)b: 


(c'A)b  =  (1 


-19 


-17) 


-123. 


This  is  true  in  general  for  a  triple  product: 

ABC  =  A(BC)  =  (AB)C.  (2.28) 

Thus  multiplication  of  three  matrices  can  be  defined  in  terms  of  the  product  of  two 
matrices,  since  (fortunately)  it  does  not  matter  which  two  are  multiplied  first.  Note 
that  A  and  B  must  be  conformable  for  multiplication,  and  B  and  C  must  be  con¬ 
formable.  For  example,  if  A  is  n  x  p,  B  is  p  x  q,  and  C  is  q  x  m,  then  both  multi¬ 
plications  are  possible  and  the  product  ABC  is  n  x  m. 

We  can  sometimes  factor  a  sum  of  triple  products  on  both  the  right  and  left  sides. 
For  example, 


ABC  +  ADC  =  A(B  +  D)C.  (2.29) 

As  another  illustration,  let  X  be  n  x  p  and  A  be  n  x  n .  Then 


X'X  -  X'AX  =  X'(X  -  AX)  =  X'(I  -  A)X. 


(2.30) 
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If  a  and  b  are  both  nxl,  then 

a'b  —  a\b\+  aib2  H - b  anbn 


(2.31) 


is  a  sum  of  products  and  is  a  scalar.  On  the  other  hand,  ab'  is  defined  for  any  size  a 
and  b  and  is  a  matrix,  either  rectangular  or  square: 


(  fli  ) 

(  a\b\  •  •  •  tiibp  ) 

02 

a2b\  a2b2  ■ ■ ■  a2bp 

ab'  = 

(bi  b2  bp)  = 

.  (2.32) 

\  tin  / 

\  anb\  anb2  ■■■  anbp  ^ 

Similarly, 


SI  SL  —  Cl^  H-  #2  +  •  •  •  + 

(  a  j  ai«2 

a\an  ) 

a2ai  ai, 

d2an 

aa'  = 

(  OnOi  OfiO 2  ... 

an  ) 

(2.33) 

(2.34) 


Thus  a' a  is  a  sum  of  squares,  and  aa'  is  a  square  (symmetric)  matrix.  The  products  a' a 
and  aa'  are  sometimes  referred  to  as  the  dot  product  and  matrix  product,  respectively. 
The  square  root  of  the  sum  of  squares  of  the  elements  of  a  is  the  distance  from  the 
origin  to  the  point  a  and  is  also  referred  to  as  the  length  of  a: 


Length  of  a  =  Va^a  =  JYH=  i  a, 


(2.35) 


As  special  cases  of  (2.33)  and  (2.34),  note  that  if  j  is  n  x  1,  then 


S'i  =  n,  jj'  = 


/  1  1  ...  1  \ 

1  1  •••  1 


=  J, 


(2.36) 


VI  1  •••  1/ 

where  j  and  J  were  defined  in  (2.11)  and  (2. 12).  If  a  is  n  x  1  and  A  is  n  x  p,  then 


n 

a'j  =  J'a  =  XI  a‘  ’ 

i=\ 


(2.37) 
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Aj  = 


/  £/a iy  \ 

Ej  «2; 

V  Ej  anj  ) 


(2.38) 


Thus  a  j  is  the  sum  of  the  elements  in  a,  j'A  contains  the  column  sums  of  A,  and  Aj 
contains  the  row  sums  of  A.  In  a  j ,  the  vector  j  is  n  x  1;  in  j'A,  the  vector  j  is  n  x  1; 
and  in  Aj,  the  vector  j  is  p  x  1. 

Since  a'b  is  a  scalar,  it  is  equal  to  its  transpose: 


a'b  =  (a'b)'  =  b'(a')'  =  b'a. 


(2.39) 


This  allows  us  to  write  (a'b)2  in  the  form 

(a'b)2  =  (a'b)  (a'b)  =  (a'b)  (b'a)  =  a'(bb')a.  (2.40) 


From  (2.18),  (2.26),  and  (2.39)  we  obtain 


(x  -  y)'  (x  -  y)  =  x'x  -  2x'y  +  y'y .  (2.4 1 ) 

Note  that  in  analogous  expressions  with  matrices,  however,  the  two  middle  terms 
cannot  be  combined: 


(A  -  B)'(A  -  B)  =  A' A  -  A'B  -  B'A  +  B'B. 

(A  -  B)2  =  (A  -  B)(A  -  B)  =  A2  -  AB  -  BA  +  B2. 


If  a  and  xi ,  xt, . . .  ,  x„  are  all  p  x  1  and  A  is  px  p,  we  obtain  the  following  factoring 
results  as  extensions  of  (2.21)  and  (2.29): 


X>'x,.=a'£>, 

1  =  1  1  =  1 

n  n 

J^A  X;  =  A^X;, 
i=t  ;= t 

n  /  n  \ 

'x/)2  =  a'  ^  X,x'  a  [by  (2.40)], 

1  =  1  \l  =  1  / 

n  /  n  \ 

^  Ax,(Ax,)'  =  A  (  J2XiX'i  )  A'- 


i=l 


\i= 1 


(2.42) 

(2.43) 

(2.44) 

(2.45) 


We  can  express  matrix  multiplication  in  terms  of  row  vectors  and  column  vectors. 
If  a'  is  the  i  th  row  of  A  and  b  j  is  the  jth  column  of  B,  then  the  (i  /  jth  element  of  AB 
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is  a',  by.  For  example,  if  A  has  three  rows  and  B  has  two  columns. 


A  = 


B  =  (bi,b2), 


then  the  product  AB  can  be  written  as 


( 

a'jbi 

a,  b2 

AB  = 

a2bi 

a2b2 

V 

a^bi 

a^b2 

terms  of  the  rows  of  A: 

ai(bi,b2) 

\  ( 

a'  B 

a(,  (bi ,  b2) 

= 

a'B 

a^(bi,b2) 

/  v 

a3B 

(2.46) 


AB  =  I  ai, (bi , b2)  =  a(B  |  =  |  a(  |B. 

\  a^bi,  b2)  /  V  a! 

Note  that  the  first  column  of  AB  in  (2.46)  is 

=  Abi, 

and  likewise  the  second  column  is  Ab2.  Thus  AB  can  be  written  in  the  form 
AB  =  A(bi,  b?)  =  (Abi ,  Ab2). 


(2.47) 


This  result  holds  in  general: 

AB  =  A(bi,  b2 . bp)  =  (Abi,  Ab2,  •  •  •  ,Abp).  (2.48) 

To  further  illustrate  matrix  multiplication  in  terms  of  rows  and  columns,  let  A  = 
(jj,1)  be  a  2  x  p  matrix,  x  he  a  p  x  1  vector,  and  S  be  a  p  x  p  matrix.  Then 

(2.49) 

(2.50) 

Any  matrix  can  be  multiplied  by  its  transpose.  If  A  is  n  x  p,  then 

AA'  is  n  x  n  and  is  obtained  as  products  of  rows  of  A  [see  (2.52)]. 


Ax  = 


ASA'=(ai?ai  “I?2 


a'9Sai  a(Sa2 
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Similarly, 

A' A  is  p  x  p  and  is  obtained  as  products  of  columns  of  A  [see  (2.54)]. 


From  (2.6)  and  (2.27),  it  is  clear  that  both  AA'  and  A' A  are  symmetric. 

In  the  preceding  illustration  for  AB  in  terms  of  row  and  column  vectors,  the  rows 
of  A  were  denoted  by  a'  and  the  columns  of  B.  by  b ; .  If  both  rows  and  columns  of 
a  matrix  A  are  under  discussion,  as  in  AA'  and  A' A,  we  will  use  the  notation  a'  for 
rows  and  an)  for  columns.  To  illustrate,  if  A  is  3  x  4,  we  have 


/  on 

0\2 

0 13 

a  14 

A  =  1  02\ 

022 

023 

<424 

\  «31 

032 

033 

<434 

f  *1  \ 

= 

a2 

/ 

l  a3  / 

=  (3(1)-  3(2),  a(3),  a(4)), 


where,  for  example, 


3t  =  («21  022  023  <224 ), 


(013 
023 
033 


With  this  notation  for  rows  and  columns  of  A,  we  can  express  the  elements  of 
A' A  or  of  A  A'  as  products  of  the  rows  of  A  or  of  the  columns  of  A.  Thus  if  we  write 
A  in  terms  of  its  rows  as 


/  ai  \ 

*2 

V  a'n  / 


then  we  have 


A'A=  (ai,a2, ...  ,a„) 


/a'.  \ 


V  a»  / 


=  1 >at’ 


i=t 


AA'  = 


(  ai  \ 


V  <  7 


(ai ,  a2, . . .  ,  a„)  = 


(  3'ai  a'a2 

a'2ai  a)a2 


\  3;,ai  a;,a2 


a,  a„  ^ 
a23„ 

3 ;,a„  ) 


(2.51) 


.  (2.52) 


Similarly,  if  we  express  A  in  terms  of  its  columns  as 


A  =  (a(i),  a(2), . . .  ,  a(P)), 
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then 


AA'  =  (a(i),  a(2), . . .  ,  a (p)) 


/  a(D  \ 

a(2) 


V  ) 


-  j2a(.na{j)' 

7  =  1 


(2.53) 


A' A  = 


/  au)  \ 

a(2) 


(a(i),  af2). 


■  a(p)) 


V  a'(P)  / 

/  a<l)a(l) 
a(2)a(l) 

V  a(P,a(D 


»d)a(2) 

*(2)a(2) 


*<p)a<2> 


*(l)a(P)  \ 


a(2)a(P) 


*(p)a(p)  7 


(2.54) 


Let  A  =  (ttij )  be  an  nxn  matrix  and  D  be  a  diagonal  matrix,  D  =  diag(c/i ,  d 2,  ■  ■  ■  ■  dn 
Then,  in  the  product  DA,  the  zth  row  of  A  is  multiplied  by  c/,\  and  in  AD,  the  /  th 
column  of  A  is  multiplied  by  dj.  For  example,  if  n  —  3,  we  have 


DA  = 


d] 

0 

0  \ 

1  on 

012 

013 

0 

0 

I  <321 

«22 

023 

0 

0 

d3  7 

V  <331 

«32 

033 

d\a\\  d\ci\i  <r/i  £/ 13 

diCl2l  d2Cl22  d2a23 

d^aji  <s?3«32  1/3033 


(2.55) 


AD  = 


Oil 

012 

013  \ 

/  * 

0 

0 

021 

022 

023  1 

0 

fl?2 

0 

031 

032 

033  7 

V  0 

0 

c/3 

DAD  = 


d\Cl  [1  t/2fl12  1/36(13 

d\Cl2\  d2Cl22  d3a23 

d\U3\  d2Cl32  1/36(33 

d^a  11 

d2d\ci2i  d^a  22  1/21/31723 
<^3^1«31  ^3^2«32  d£ay$ 


In  the  special  case  where  the  diagonal  matrix  is  the  identity,  we  have 

IA  =  AI  =  A. 


(2.56) 


(2.57) 


(2.58) 


If  A  is  rectangular,  (2.58)  still  holds,  but  the  two  identities  are  of  different  sizes. 
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The  product  of  a  scalar  and  a  matrix  is  obtained  by  multiplying  each  element  of 
the  matrix  by  the  scalar: 


For  example, 


cA  =  (caij)  — 


cx  ■ 


cl  = 


can 

ca\2 

' *  ca\m 

ca2i 

ca22  • 

■  *  ca2m 

can\ 

can  2 

■ *  canm 

(  c  0 
0  c 


\  o  o 

(  cx\  \ 

CX  2 


o  \ 
0 


c  / 


\  CXn  J 

Since  caij  —  cijjc,  the  product  of  a  scalar  and  a  matrix  is  commutative: 

cA  =  Ac. 


(2.59) 


(2.60) 


(2.61) 


(2.62) 


Multiplication  of  vectors  or  matrices  by  scalars  permits  the  use  of  linear  combi¬ 
nations,  such  as 

k 

^  cij  X/  =  c?  i  x i  -\~  612X2  -(-***  T 

f=i 

k 

X!  fl/B,  =  o  i  B  i  +  fl2®2  +  •  •  •  +  cikRk- 

i=  1 

If  A  is  a  symmetric  matrix  and  x  and  y  are  vectors,  the  product 

y'Ay  =  X 0,7  y? +  X  “‘jyiyj  (2-63) 

i 


is  called  a  quadratic  form,  whereas 

x'Av  =  X aijXiyj 
ij 


(2.64) 


20 


MATRIX  ALGEBRA 


is  called  a  bilinear  form.  Either  of  these  is,  of  course,  a  scalar  and  can  be  treated 
as  such.  Expressions  such  as  x'Av/Vx'Ax  are  permissible  (assuming  A  is  positive 
definite;  see  Section  2.7). 


2.4  PARTITIONED  MATRICES 

It  is  sometimes  convenient  to  partition  a  matrix  into  submatrices.  For  example,  a 
partitioning  of  a  matrix  A  into  four  submatrices  could  be  indicated  symbolically  as 
follows: 


A  = 


Ml 


M2 


A21  A22 


For  example,  a  4  x  5  matrix  A  can  be  partitioned  as 

( 


where 


A  = 


2  1  3 
-3  4  0 
9  3  6 


\  4  8  3 


4\ 

7 

-2 


1 


An 

A21 


6  ) 


2  1  3 

An  =  |  -3  4  0 

9  3  6 

A21  =  (4  8  3), 


A12 

A22 


A22  =  (l  6). 


If  two  matrices  A  and  B  are  conformable  and  A  and  B  are  partitioned  so  that  the 
submatrices  are  appropriately  conformable,  then  the  product  AB  can  be  found  by 
following  the  usual  row-by-column  pattern  of  multiplication  on  the  submatrices  as  if 
they  were  single  elements;  for  example. 


/  An  A12  \/  Bn  B12  \ 

V  A21  A22  /\  B21  B22  ) 

(  Ai  1 B 1 1  +  A12B21  A11B12  +  A12B22  \ 

\  A21B11  +  A22B21  A21B12  +  A22B22  / 


It  can  be  seen  that  this  formulation  is  equivalent  to  the  usual  row-by-column  defi¬ 
nition  of  matrix  multiplication.  For  example,  the  (1,1)  element  of  AB  is  the  product 
of  the  first  row  of  A  and  the  first  column  of  B.  In  the  (1,  1)  element  of  An  Bn  we 
have  the  sum  of  products  of  part  of  the  first  row  of  A  and  part  of  the  first  column  of 
B.  In  the  (1,1)  element  of  A12B21  we  have  the  sum  of  products  of  the  rest  of  the  first 
row  of  A  and  the  remainder  of  the  first  column  of  B. 
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Multiplication  of  a  matrix  and  a  vector  can  also  be  carried  out  in  partitioned  form. 
For  example, 


Ab  =  (A1;  A2)  yb,  )=  A'bi  +  A2fe2-  (2-66) 

where  the  partitioning  of  the  columns  of  A  corresponds  to  the  partitioning  of  the 
elements  of  b.  Note  that  the  partitioning  of  A  into  two  sets  of  columns  is  indicated 
by  a  comma,  A  =  (A| ,  A2). 

The  partitioned  multiplication  in  (2.66)  can  be  extended  to  individual  columns  of 
A  and  individual  elements  of  b: 


(  b]  \ 

b2 

Ab  =  (ai,a2,...  ,ap) 

V  bp  ) 

=  b\a\  +  bi a2  H - b  bp ap 


(2.67) 


Thus  Ab  is  expressible  as  a  linear  combination  of  the  columns  of  A,  the  coefficients 
being  elements  of  b.  For  example,  let 


Then 


A 


3  -2  1  \ 

2  1  0  and 

4  3  2  / 


4 

2 

3 


Ab  = 


11 

10 

28 


Using  a  linear  combination  of  columns  of  A  as  in  (2.67),  we  obtain 


Ab  =  hiai  +  b2  a2  +  b3a3 


11 

10 

28 
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We  note  that  if  A  is  partitioned  as  in  (2.66),  A  =  (A2,  A2),  the  transpose  is  not  equal 
to  (Aj,  A^),  but  rather 


A'  =  (Ai,A2)' 


(2.68) 


2.5  RANK 

Before  defining  the  rank  of  a  matrix,  we  first  introduce  the  notion  of  linear  inde¬ 
pendence  and  dependence.  A  set  of  vectors  ai,  a2, . . .  ,  a„  is  said  to  be  linearly 
dependent  if  constants  c\,  c2, . . .  ,  cn  (not  all  zero)  can  be  found  such  that 

ciai  +  c2a2  H - h  c„a„  =  0.  (2.69) 

If  no  constants  ci,  c2, . . .  ,  c„  can  be  found  satisfying  (2.69),  the  set  of  vectors  is  said 
to  be  linearly  independent. 

If  (2.69)  holds,  then  at  least  one  of  the  vectors  a/  can  be  expressed  as  a  linear 
combination  of  the  other  vectors  in  the  set.  Thus  linear  dependence  of  a  set  of  vec¬ 
tors  implies  redundancy  in  the  set.  Among  linearly  independent  vectors  there  is  no 
redundancy  of  this  type. 

The  rank  of  any  square  or  rectangular  matrix  A  is  defined  as 

rank(A)  =  number  of  linearly  independent  rows  of  A 

=  number  of  linearly  independent  columns  of  A. 

It  can  be  shown  that  the  number  of  linearly  independent  rows  of  a  matrix  is  always 
equal  to  the  number  of  linearly  independent  columns. 

If  A  is  n  x  p,  the  maximum  possible  rank  of  A  is  the  smaller  of  n  and  p,  in  which 
case  A  is  said  to  be  of  full  rank  (sometimes  said  full  row  rank  or  full  column  rank). 
For  example, 


A  = 


-2  3 
2  4 


has  rank  2  because  the  two  rows  are  linearly  independent  (neither  row  is  a  multiple  of 
the  other).  However,  even  though  A  is  full  rank,  the  columns  are  linearly  dependent 
because  rank  2  implies  there  are  only  two  linearly  independent  columns.  Thus,  by 
(2.69),  there  exist  constants  c  1,  c2,  and  C3  such  that 


ci 


(2.70) 


INVERSE 


23 


By  (2.67),  we  can  write  (2.70)  in  the  form 


Ac  =  0. 


(2.71) 


A  solution  vector  to  (2.70)  or  (2.71)  is  given  by  any  multiple  of  c  =  (14,  —1 1,  —12)'. 
Hence  we  have  the  interesting  result  that  a  product  of  a  matrix  A  and  a  vector  c  is 
equal  to  0,  even  though  A/O  and  c^O.  This  is  a  direct  consequence  of  the  linear 
dependence  of  the  column  vectors  of  A. 

Another  consequence  of  the  linear  dependence  of  rows  or  columns  of  a  matrix  is 
the  possibility  of  expressions  such  as  AB  =  CB,  where  A  /  C.  For  example,  let 


Then 


AB  =  CB 


3 

1 


5 

4 


All  three  matrices  A,  B,  and  C  are  full  rank;  but  being  rectangular,  they  have  a  rank 
deficiency  in  either  rows  or  columns,  which  permits  us  to  construct  AB  =  CB  with 
A  ^  C.  Thus  in  a  matrix  equation,  we  cannot,  in  general,  cancel  matrices  from  both 
sides  of  the  equation. 

There  are  two  exceptions  to  this  rule.  One  exception  involves  a  nonsingular  matrix 
to  be  defined  in  Section  2.6.  The  other  special  case  occurs  when  the  expression  holds 
for  all  possible  values  of  the  matrix  common  to  both  sides  of  the  equation.  For  exam¬ 
ple. 


If  Ax  =  Bx  for  all  possible  values  of  x,  then  A  =  B.  (2.72) 

To  see  this,  let  x  =  (1,  0,  . . .  ,0)'.  Then  the  first  column  of  A  equals  the  first  column 
of  B.  Now  let  x  =  (0,  1 ,  0, . . .  ,  0)',  and  the  second  column  of  A  equals  the  second 
column  of  B.  Continuing  in  this  fashion,  we  obtain  A  =  B. 

Suppose  a  rectangular  matrix  A  is  n  x  p  of  rank  p,  where  p  <  n.  We  typically 
shorten  this  statement  to  “A  is  n  x  p  of  rank  p  <  n.” 

2.6  INVERSE 

If  a  matrix  A  is  square  and  of  full  rank,  then  A  is  said  to  be  nonsingular,  and  A  has 
a  unique  inverse,  denoted  by  A-1,  with  the  property  that 

AA“‘  =  A-1A  =  I. 


(2.73) 
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For  example,  let 


If  A  is  square  and  of  less  than  full  rank,  then  an  inverse  does  not  exist,  and  A  is 
said  to  be  singular.  Note  that  rectangular  matrices  do  not  have  inverses  as  in  (2.73), 
even  if  they  are  full  rank. 

If  A  and  B  are  the  same  size  and  nonsingular,  then  the  inverse  of  their  product  is 
the  product  of  their  inverses  in  reverse  order, 

(AB)-1  =  B-1A-1.  (2.74) 

Note  that  (2.74)  holds  only  for  nonsingular  matrices.  Thus,  for  example,  if  A  is  n  x  p 
of  rank  p  <  n,  then  A' A  has  an  inverse,  but  (A'A)-1  is  not  equal  to  A_1(A,)_1 
because  A  is  rectangular  and  does  not  have  an  inverse. 

If  a  matrix  is  nonsingular,  it  can  be  canceled  from  both  sides  of  an  equation,  pro¬ 
vided  it  appears  on  the  left  (or  right)  on  both  sides.  For  example,  if  B  is  nonsingular, 
then 


AB  =  CB  implies  A  =  C, 
since  we  can  multiply  on  the  right  by  B-1  to  obtain 

ABB-1  =  CBB1 , 

AI  =  Cl, 

A  =  C. 

Otherwise,  if  A,  B,  and  C  are  rectangular  or  square  and  singular,  it  is  easy  to  construct 
AB  =  CB,  with  A  ^  C,  as  illustrated  near  the  end  of  Section  2.5. 

The  inverse  of  the  transpose  of  a  nonsingular  matrix  is  given  by  the  transpose  of 
the  inverse: 


(A')”1  =  (A”1)'.  (2.75) 

If  the  symmetric  nonsingular  matrix  A  is  partitioned  in  the  form 


An  an 

a12  a22 
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then  the  inverse  is  given  by 


A-l  _  I  (  ^11  +  A-n  1  AHai2  \ 

b  V  -a'l2An'  1  )' 


(2.76) 


where  b  =  022  —  a  j  1 A  j  ( 1  ti  1 9  -  A  nonsingular  matrix  of  the  form  B  +  cc',  where  B  is 
nonsingular,  has  as  its  inverse 


(B  +  cc'r1  =  B_l 


B'cc'B-1 
1  +  c'B_1c 


(2.77) 


2.7  POSITIVE  DEFINITE  MATRICES 

The  symmetric  matrix  A  is  said  to  be  positive  definite  if  x'Ax  >  0  for  all  possible 
vectors  x  (except  x  =  0).  Similarly,  A  is  positive  semidefinite  if  x'Ax  >  0  for  all 
x  ^  0.  |  A  quadratic  form  x'Ax  was  defined  in  (2.63).]  The  diagonal  elements  an  of  a 
positive  definite  matrix  are  positive.  To  see  this,  let  x'  =  (0,  . . .  ,0,1,0,...  ,0)  with 
a  1  in  the  /th  position.  Then  x'Ax  =  an  >  0.  Similarly,  for  a  positive  semidefinite 
matrix  A,  an  >0  for  all  i. 

One  way  to  obtain  a  positive  definite  matrix  is  as  follows: 

If  A  =  B'B,  where  B  is  n  x  p  of  rank  p  <  n,  then  B'B  is  positive  definite.  (2.78) 
This  is  easily  shown: 


x'Ax  =  x'B'Bx  =  (Bx)'(Bx)  =  z'z, 

where  z  =  Bx.  Thus  x'Ax  =  ]C/=i  £?>  which  is  positive  (Bx  cannot  be  0  unless 
x  =  0,  because  B  is  full  rank).  If  B  is  less  than  full  rank,  then  by  a  similar  argument, 
B'B  is  positive  semidefinite. 

Note  that  A  =  B'B  is  analogous  to  a  =  b2  in  real  numbers,  where  the  square  of 
any  number  (including  negative  numbers)  is  positive. 

In  another  analogy  to  positive  real  numbers,  a  positive  definite  matrix  can  be 
factored  into  a  “square  root”  in  two  ways.  We  give  one  method  in  (2.79)  and  the 
other  in  Section  2.11.8. 

A  positive  definite  matrix  A  can  be  factored  into 

A  =  T'T,  (2.79) 

where  T  is  a  nonsingular  upper  triangular  matrix.  One  way  to  obtain  T  is  the 
Cholesky  decomposition,  which  can  be  carried  out  in  the  following  steps. 

Let  A  =  (flij)  and  T  =  ( t/j )  be  n  x  n.  Then  the  elements  of  T  are  found  as 
follows: 
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, -  a\j 

t  it  =  V«i  l ,  t\j  = - 

2  < 

j  <  n. 

111 

i — 1 

tii  =  an  -  ^2  tki 

2  < 

i  <  n, 

N  k=l 

.  a'j  ~  J2k=  1  ^ihj 

lJ  ~  tn 

2  < 

i  <  j  <  n 

tij  =  0 

1  < 

j  <i  <  n 

For  example,  let 


Then  by  the  Cholesky  method,  we  obtain 

/  V3  0  -V3  \ 

T  =  j  0  V6  VL5  , 
\  0  0  VF5  / 


/  V3  0  0  \/V3  0  — \/3  \ 

T'T  =  j  0  V6  0  0  V6  VL5 

\  -V3  vTI  vT5  /V  o  0  VL5  / 

3  0  -3  \ 

0  6  3  =  A. 

-3  3  6  / 


2.8  DETERMINANTS 

The  determinant  of  an  n  x  n  matrix  A  is  defined  as  the  sum  of  all  n !  possible  products 
of  n  elements  such  that 

1.  each  product  contains  one  element  from  every  row  and  every  column,  and 

2.  the  factors  in  each  product  are  written  so  that  the  column  subscripts  appear  in 
order  of  magnitude  and  each  product  is  then  preceded  by  a  plus  or  minus  sign 
according  to  whether  the  number  of  inversions  in  the  row  subscripts  is  even  or 
odd. 

An  inversion  occurs  whenever  a  larger  number  precedes  a  smaller  one.  The  symbol 
n !  is  defined  as 


n !  —  n(n  —  1 )  (n  —  2)  •  •  •  2  •  1 . 


(2.80) 
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The  determinant  of  A  is  a  scalar  denoted  by  |A|  or  by  det(A).  The  preceding  def¬ 
inition  is  not  useful  in  evaluating  determinants,  except  in  the  case  of  2  x  2  or  3  x  3 
matrices.  For  larger  matrices,  other  methods  are  available  for  manual  computation, 
but  determinants  are  typically  evaluated  by  computer.  For  a  2  x  2  matrix,  the  deter¬ 
minant  is  found  by 


|A| 


an  a  12 
a 21  022 


=  011022  —  O21012- 


(2.81) 


For  a  3  x  3  matrix,  the  determinant  is  given  by 


|A|  =  011022033  +  012023  031  +  013032021  —  031022013  —  032023«11  —  033012021- 

(2.82) 


This  can  be  found  by  the  following  scheme.  The  three  positive  terms  are  obtained  by 

an  aJ2  an 

X  X 

&21  &22  &23 

X  X 


and  the  three  negative  terms,  by 


an 

321 
a3i 

The  determinant  of  a  diagonal  matrix  is  the  product  of  the  diagonal  elements;  that 
is,  if  D  =  diag(c/i ,  d2,  . .  ■  ,  d„),  then 

n 

|D|  =  ]~ [  di.  (2.83) 

;=i 

As  a  special  case  of  (2.83),  suppose  all  diagonal  elements  are  equal,  say, 


D  =  diag(c,  c, . . .  ,c)  —  cl. 
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Then 

n 

|D|  =  |cl|  =nc  =  c".  ( 2.84) 

1=1 

The  extension  of  (2.84)  to  any  square  matrix  A  is 

|cA|  =  c"|A|.  (2.85) 

Since  the  determinant  is  a  scalar,  we  can  carry  out  operations  such  as 

|A|“,  |A|'/-,  -1, 

|A| 

provided  that  |A|  >  0  for  |A|1,/2  and  that  |A|  /  0  for  1/|A|. 

If  the  square  matrix  A  is  singular,  its  determinant  is  0: 

|A|  =  0  if  A  is  singular.  (2.86) 

If  A  is  near  singular ,  then  there  exists  a  linear  combination  of  the  columns  that  is 
close  to  0,  and  |A|  is  also  close  to  0.  If  A  is  nonsingular,  its  determinant  is  nonzero: 

|A|  0  if  A  is  nonsingular.  (2.87) 

If  A  is  positive  definite,  its  determinant  is  positive: 

|A|  >  0  if  A  is  positive  definite.  (2.88) 

If  A  and  B  are  square  and  the  same  size,  the  determinant  of  the  product  is  the 
product  of  the  determinants: 

I AB  |  =  |A||B|.  (2.89) 

For  example,  let 

a=(_5  5)  and  b=(?  ])■ 

Then 

AB=  (  J  ®  y  I  AB  I  =  110, 

|A|  =  11,  |B|  =  10,  |A||B|  =  110. 


The  determinant  of  the  transpose  of  a  matrix  is  the  same  as  the  determinant  of  the 
matrix,  and  the  determinant  of  the  the  inverse  of  a  matrix  is  the  reciprocal  of  the 
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determinant: 


|A'|  =  |A|, 

|A_1 1  =  —  =  |A|_1. 
|A| 


If  a  partitioned  matrix  has  the  form 


A  = 


An  O  \ 

O  A22  / 


where  An  and  A22  are  square  but  not  necessarily  the  same  size,  then 


|A| 


An  O 
O  A22 


I  Ai  1 1 1 A22 1  - 


For  a  general  partitioned  matrix. 


A  = 


An  A12  \ 

A21  A22  / 


(2.90) 

(2.91) 


(2.92) 


where  An  and  A22  are  square  and  nonsingular  (not  necessarily  the  same  size),  the 
determinant  is  given  by  either  of  the  following  two  expressions: 

=  I  Ai  1 1 1 A22  —  A21  Aj  /  A12I  (2.93) 

=  I A22 1 1  Ai  1  —  Ai2A221A2i|.  (2.94) 

Note  the  analogy  of  (2.93)  and  (2.94)  to  the  case  of  the  determinant  of  a  2  x  2 
matrix  as  given  by  (2.81): 


An  A12 
A21  A22 


flu  fl  12 
«21  «22 


=  flllfl22  —  A21A12 


=  All  fl22 


A 22  All  — 


fl21fll2 

ail 

fll2fl21 

fl22 


If  B  is  nonsingular  and  c  is  a  vector,  then 


|B  +  cc'|  =  |B|(1  +  c'B_1c). 


(2.95) 
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2.9  TRACE 


A  simple  function  of  an  n  x  n  matrix  A  is  the  trace ,  denoted  by  tr(A)  and  defined 
as  the  sum  of  the  diagonal  elements  of  A;  that  is,  tr(A)  =  an-  The  trace  is,  of 
course,  a  scalar.  For  example,  suppose 


Then 


tr(A)  =  5  +  (-3)  +  9  =  11. 


The  trace  of  the  sum  of  two  square  matrices  is  the  sum  of  the  traces  of  the  two 
matrices: 


tr(A  +  B)  =  tr(A)  +  tr(B).  (2.96) 

An  important  result  for  the  product  of  two  matrices  is 

tr(AB)  =  tr(BA).  (2.97) 


This  result  holds  for  any  matrices  A  and  B  where  AB  and  BA  are  both  defined.  It  is 
not  necessary  that  A  and  B  be  square  or  that  AB  equal  BA.  For  example,  let 


Then 


/ 

9 

10 

16 

AB  = 

4 

-8 

-3 

V 

24 

16 

34 

tr(AB)  =  9 

-  8  +  34  = 

=  35, 

BA  = 


tr(BA)  = 


3  +  32 


=  35. 


From  (2.52)  and  (2.54),  we  obtain 


n  p 

tr(A'A)  =  tr( AA')  =  J2  afj  >  (2.98) 

i=i  ;= i 


where  the  a/j’s  are  elements  of  the  n  x  p  matrix  A. 
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2.10  ORTHOGONAL  VECTORS  AND  MATRICES 

Two  vectors  a  and  b  of  the  same  size  are  said  to  be  orthogonal  if 

a'b  =  a\b\  +  02^2  H - f  anbn  =  0.  (2.99) 

Geometrically,  orthogonal  vectors  are  perpendicular  [see  (3.14)  and  the  comments 
following  (3.14)].  If  a' a  =  1,  the  vector  a  is  said  to  be  normalized.  The  vector  a  can 
always  be  normalized  by  dividing  by  its  length,  V a' a.  Thus 

c  =  — (2.100) 
Va'a 


is  normalized  so  that  c'c  =  1 . 

A  matrix  C  =  (ci,C2, ...  ,cp)  whose  columns  are  normalized  and  mutually 
orthogonal  is  called  an  orthogonal  matrix.  Since  the  elements  of  C'C  are  products  of 
columns  of  C  [see  (2.54)],  which  have  the  properties  c^c ;  =  1  for  all  i  and  c^c;  =  0 
for  all  i  ^  j ,  we  have 


C'C  =  I.  (2.101) 

If  C  satisfies  (2.101),  it  necessarily  follows  that 

CC'  =  I,  (2.102) 

from  which  we  see  that  the  rows  of  C  are  also  normalized  and  mutually  orthogonal. 
It  is  clear  from  (2.101)  and  (2.102)  that  C_l  =  C'  for  an  orthogonal  matrix  C. 

We  illustrate  the  creation  of  an  orthogonal  matrix  by  starting  with 


whose  columns  are  mutually  orthogonal.  To  normalize  the  three  columns,  we  divide 
by  the  respective  lengths,  \/3,  \/6,  and  VT  to  obtain 

/  1/V3  1/V6  1/V2  \ 

C=  1/V3  1/V6  -1/V2  • 

V  1/V3  -2/V6  0  J 

Note  that  the  rows  also  became  normalized  and  mutually  orthogonal  so  that  C  satis¬ 
fies  both  (2.101)  and  (2.102). 

Multiplication  by  an  orthogonal  matrix  has  the  effect  of  rotating  axes;  that  is,  if  a 
point  x  is  transformed  to  z  =  Cx,  where  C  is  orthogonal,  then 
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z'z  =  (Cx)'(Cx)  =  x'C'Cx  =  x'lx  =  x'x,  (2.103) 

and  the  distance  from  the  origin  to  z  is  the  same  as  the  distance  to  x. 


2.11  EIGENVALUES  AND  EIGENVECTORS 
2.11.1  Definition 

For  every  square  matrix  A,  a  scalar  A  and  a  nonzero  vector  x  can  be  found  such  that 

Ax  =  Xx.  (2.104) 

In  (2.104),  X  is  called  an  eigenvalue  of  A,  and  x  is  an  eigenvector  of  A  corresponding 
to  A.  To  find  X  and  x,  we  write  (2.104)  as 


(A  —  AI)x  =  0.  (2.105) 

If  |  A  —  AI|  ^  0,  then  (A  —  AI)  has  an  inverse  and  x  =  0  is  the  only  solution.  Hence, 
in  order  to  obtain  nontrivial  solutions,  we  set  |A  —  AI|  =0  to  find  values  of  A 
that  can  be  substituted  into  (2.105)  to  find  corresponding  values  of  x.  Alternatively, 
(2.69)  and  (2.7 1 )  require  that  the  columns  of  A  —  AI  be  linearly  dependent.  Thus  in 
(A  —  AI)x  =  0,  the  matrix  A  —  AI  must  be  singular  in  order  to  find  a  solution  vector 
x  that  is  not  0. 

The  equation  |A  —  AI|  =0  is  called  the  characteristic  equation.  If  A  is  n  x  n. 
the  characteristic  equation  will  have  n  roots;  that  is,  A  will  have  n  eigenvalues  Aj, 
A2, . . .  ,  A„.  The  A’s  will  not  necessarily  all  be  distinct  or  all  nonzero.  However,  if  A 
arises  from  computations  on  real  (continuous)  data  and  is  nonsingular,  the  A’s  will 
all  be  distinct  (with  probability  1).  After  finding  Aj,  A2, . . .  ,  X„ ,  the  accompanying 
eigenvectors  xj,  X2, . . .  ,  x„  can  be  found  using  (2.105). 

If  we  multiply  both  sides  of  (2.105)  by  a  scalar  k  and  note  by  (2.62)  that  k  and 
A  —  AI  commute,  we  obtain 


(A  -  AI )kx  =  kO  =  0.  (2. 106) 

Thus  if  x  is  an  eigenvector  of  A,  kx  is  also  an  eigenvector,  and  eigenvectors  are 
unique  only  up  to  multiplication  by  a  scalar.  Hence  we  can  adjust  the  length  of  x, 
but  the  direction  from  the  origin  is  unique;  that  is,  the  relative  values  of  (ratios  of) 
the  components  of  x  =  ■  ■  ■  ,  x„Y  are  unique.  Typically,  the  eigenvector  x  is 

scaled  so  that  x'x  =  1 . 

To  illustrate,  we  will  find  the  eigenvalues  and  eigenvectors  for  the  matrix 


A  = 


1  2 
-1  4 
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The  characteristic  equation  is 


|A  —  AI|  = 


1  -  A  2 
-1  4- A 


(1  -A)(4-A)  +  2  =  0, 


A2  -  5A  +  6  =  (X  -  3)  (A.  -  2)  =  0, 


from  which  Ai  =  3  and  X2  =  2.  To  find  the  eigenvector  corresponding  to  X 1  =  3,  we 
use  (2.105), 


(A  -  AI)x  =  0, 


1-3  2 

-1  4-3 


x\ 

X2 


—2xi  +  2x2  =  0 

—X\  +  X2  =  0. 


As  expected,  either  equation  is  redundant  in  the  presence  of  the  other,  and  there 
remains  a  single  equation  with  two  unknowns,  xi  =  x2  ■  The  solution  vector  can  be 
written  with  an  arbitrary  constant. 


xi 

X2 


=  Xl 


1 

1 


If  c  is  set  equal  to  1  /V2  to  normalize  the  eigenvector,  we  obtain 


xi  = 


1/V2  \ 
1/V2  )■ 


Similarly,  corresponding  to  X2  —  2,  we  have 


x2  = 


2/V5  \ 
1/V5  )■ 


2.11.2  I  +  A  and  I  -  A 

If  A  is  an  eigenvalue  of  A  and  x  is  the  corresponding  eigenvector,  then  1  +  X  is  an 
eigenvalue  of  I  +  A  and  1  —  X  is  an  eigenvalue  of  I  —  A.  In  either  case,  x  is  the 
corresponding  eigenvector. 

We  demonstrate  this  for  I  +  A: 


Ax  =  Ax, 
x  +  Ax  =  x  +  Ax, 

(I  +  A)x  =  (1  +  A)x. 
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2.11.3  tr(A)  and  |A| 

For  any  square  matrix  A  with  eigenvalues  A.i,  . . .  .  /,„ ,  we  have 


tr(A  )  =  J2X‘’  (2.107) 

1  =  1 
n 

iai = n^'-  (2-io8) 

i=i 

Note  that  by  the  definition  in  Section  2.9,  tr(A)  is  also  equal  to  iaii>  but 
hj  . 

We  illustrate  (2.107)  and  (2.108)  using  the  matrix 


A  = 


1 

-1 


2 

4 


from  the  illustration  in  Section  2.1 1.1,  for  which  Xi  =  3  and  A.  2  =  2.  Using  (2.107), 
we  obtain 


tr(A)  —  A.j  +  7.2  =  3  +  2  =  5, 


and  from  (2.108),  we  have 


|A|  =  A.1A.2  =  3(2)  =  6. 


By  definition,  we  obtain 

tr(A)  =  1  +  4  =  5  and  |A|  =  (1)(4)  -  (— 1)(2)  =  6. 

2.11.4  Positive  Definite  and  Semidefinite  Matrices 

The  eigenvalues  and  eigenvectors  of  positive  definite  and  positive  semidefinite  matri¬ 
ces  have  the  following  properties: 

1.  The  eigenvalues  of  a  positive  definite  matrix  are  all  positive. 

2.  The  eigenvalues  of  a  positive  semidefinite  matrix  are  positive  or  zero,  with  the 
number  of  positive  eigenvalues  equal  to  the  rank  of  the  matrix. 

It  is  customary  to  list  the  eigenvalues  of  a  positive  definite  matrix  in  descending 
order:  Ai  >  A.  2  >  •  •  •  >  hp.  The  eigenvectors  xi,  X2, . . .  ,  x„  are  listed  in  the  same 
order;  X]  corresponds  to  Ai,  xt  corresponds  to  A.2,  and  so  on. 

The  following  result,  known  as  the  Perron-Frobenius  theorem,  is  of  interest  in 
Chapter  12:  If  all  elements  of  the  positive  definite  matrix  A  are  positive,  then  all  ele¬ 
ments  of  the  first  eigenvector  are  positive.  (The  first  eigenvector  is  the  one  associated 
with  the  first  eigenvalue,  Aj .) 
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2.11.5  The  Product  AB 

If  A  and  B  are  square  and  the  same  size,  the  eigenvalues  of  AB  are  the  same  as  those 
of  BA,  although  the  eigenvectors  are  usually  different.  This  result  also  holds  if  AB 
and  BA  are  both  square  but  of  different  sizes,  as  when  A  is  n  x  p  and  B  is  p  x  n.  (In 
this  case,  the  nonzero  eigenvalues  of  AB  and  BA  will  be  the  same.) 

2.11.6  Symmetric  Matrix 

The  eigenvectors  of  an  n  x  n  symmetric  matrix  A  are  mutually  orthogonal.  It  follows 
that  if  the  n  eigenvectors  of  A  are  normalized  and  inserted  as  columns  of  a  matrix 
C  =  (xj,  X2, . . .  ,  x„),  then  C  is  orthogonal. 


2.11.7  Spectral  Decomposition 

It  was  noted  in  Section  2.11.6  that  if  the  matrix  C  =  (xj,  X2, . . .  ,  x„ )  contains  the 
normalized  eigenvectors  of  an  n  x  n  symmetric  matrix  A,  then  C  is  orthogonal. 
Therefore,  by  (2.102),  I  =  CC',  which  we  can  multiply  by  A  to  obtain 


A  =  ACC'. 


We  now  substitute  C  =  (xi,  X2, . . .  ,  x„ ): 


where 


A  = 


A(xi,  x2, .  .  .  , 

x„)C' 

(Axi,  Ax2,  . . . 

,  Ax„)C' 

[by  (2.48)] 

(A-lXl,  A.2X2,  .  . 

•  5  C 

[by  (2.104)] 

CDC' 

[by  (2.56)], 

/  A! 

0  ••• 

0  \ 

0  a  2  •  •  •  0 


V  0  0  •••  xn  J 


(2.109) 


(2.110) 


The  expression  A  =  CDC'  in  (2.109)  for  a  symmetric  matrix  A  in  terms  of  its 
eigenvalues  and  eigenvectors  is  known  as  the  spectral  decomposition  of  A. 

Since  C  is  orthogonal  and  C'C  =  CC'  =  I,  we  can  multiply  (2.109)  on  the  left 
by  C'  and  on  the  right  by  C  to  obtain 


C'AC  =  D.  (2.111) 

Thus  a  symmetric  matrix  A  can  be  diagonalized  by  an  orthogonal  matrix  containing 
normalized  eigenvectors  of  A,  and  by  (2.1 10)  the  resulting  diagonal  matrix  contains 
eigenvalues  of  A. 


36 


MATRIX  ALGEBRA 


2.11.8  Square  Root  Matrix 

If  A  is  positive  definite,  the  spectral  decomposition  of  A  in  (2.109)  can  be  modified 
by  taking  the  square  roots  of  the  eigenvalues  to  produce  a  square  wot  matrix. 


where 


A1/2  =  CD1/2C', 


D1/2  = 

(  y/k[ 

0 

0  •• 
V^2  •  • 

•  0  \ 

0 

l  0 

0 

\f^n  / 

(2.112) 


(2.113) 


The  square  root  matrix  A1/2  is  symmetric  and  serves  as  the  square  root  of  A: 

A1/2A1/2  =  (A1/2)2  =  A.  (2.1 14) 


2.11.9  Square  Matrices  and  Inverse  Matrices 

Other  functions  of  A  have  spectral  decompositions  analogous  to  (2.112).  Two  of 
these  are  the  square  and  inverse  of  A.  If  the  square  matrix  A  has  eigenvalues  A  i  , 
A.2, . . .  .  and  accompanying  eigenvectors  X| ,  X2, . . .  ,  x„,  then  A2  has  eigenval¬ 
ues  A2,  A^,  •  •  •  ,  -A.2  and  eigenvectors  xj,  X2, . . .  ,  x„.  If  A  is  nonsingular,  then  A-1 
has  eigenvalues  1/Ai,  I  /A2, ....  I  //.„  and  eigenvectors  xj,  X2, . . .  ,  x„.  If  A  is  also 
symmetric,  then 


A2  =  CD2C\  (2.115) 

A-1  =  CD-1C',  (2.116) 

where  C  =  (X| ,  X2, . . .  ,  x„)  has  as  columns  the  normalized  eigenvectors  of  A  (and  of 
A2  and  A-1),  D2  =  diag(A2,  Aa,  . . .  ,  A2),  and  D-1  =  diag(l/Ai,  1  /kn, . . .  ,  1/A„). 

2.11.10  Singular  Value  Decomposition 

In  (2.109)  in  Section  2.11.7,  we  expressed  a  symmetric  matrix  A  in  terms  of  its 
eigenvalues  and  eigenvectors  in  the  spectral  decomposition  A  =  CDC'.  In  a  similar 
manner,  we  can  express  any  (real)  matrix  A  in  terms  of  eigenvalues  and  eigenvectors 
of  A' A  and  A  A'.  Let  A  be  an  n  x  p  matrix  of  rank  k.  Then  the  singular  value 
decomposition  of  A  can  be  expressed  as 

A  =  UDV',  (2.117) 

where  U  is  n  x  k ,  D  is  A:  x  k,  and  V  is  p  x  k.  The  diagonal  elements  of  the  non¬ 

singular  diagonal  matrix  D  =  diag(Ai,  A2, . . .  ,  A/,)  are  the  positive  square  roots  of 
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Aj,  /, 2, . . .  ,  A?,  which  are  the  nonzero  eigenvalues  of  A'A  or  of  AA'.  The  values 
Ai,  a 2 ,  ...  ,  a are  called  the  singular  values  of  A.  The  k  columns  of  U  are  the  nor¬ 
malized  eigenvectors  of  AA'  corresponding  to  the  eigenvalues  A^,  A^, . . .  ,  A|.  The  A 
columns  of  Y  are  the  normalized  eigenvectors  of  A'A  corresponding  to  the  eigenval¬ 
ues  Aj,  A|, . . .  ,  a ^ .  Since  the  columns  of  U  and  of  V  are  (normalized)  eigenvectors 
of  symmetric  matrices,  they  are  mutually  orthogonal  (see  Section  2.11.6),  and  we 
have  U'U  =  V'V  =  I. 

PROBLEMS 
2.1  Let 


A  = 


4  2  3  \ 

7  5  8  /’ 


(a)  Find  A  +  B  and  A  —  B. 

(b)  Find  A'A  and  AA'. 


4 

-5 


2.2  Use  the  matrices  A  and  B  in  Problem  2.1: 

(a)  Find  (A  +  B)'  and  A'  +  B'  and  compare  them,  thus  illustrating  (2.15). 

(b)  Show  that  (A')'  =  A,  thus  illustrating  (2.6). 


2.3  Let 


(a)  Find  AB  and  BA. 

(b)  Find  |AB|,  |A|,  and  |B|  and  verify  that  (2.89)  holds  in  this  case. 

2.4  Use  the  matrices  A  and  B  in  Problem  2.3: 

(a)  Find  A  +  B  and  tr(A  +  B). 

(b)  Find  tr(A)  and  tr(B)  and  show  that  (2.96)  holds  for  these  matrices. 

2.5  Let 


(a)  Find  AB  and  BA. 

(b)  Compare  tr(AB)  and  tr(BA)  and  confirm  that  (2.97)  holds  here. 


(1  2 

2  4 

\  5  10 


-2 

-2 

2 


2.6  Let 
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(a)  Show  that  AB  =  O. 

(b)  Find  a  vector  x  such  that  Ax  =  0. 

(c)  Show  that  |A|  =  0. 
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2.7  Let 


(  1  -1 
A  =  -1  1 

V  4  3 


Find  the  following: 

(a)  Bx  (d) 

(b)  y'B  (e) 

(c)  x'Ax  (f) 


x'Ay 

(g)  xx' 

x'x 

(h)  xy' 

x'y 

(i)  B'B 

2.8  Use  x,  y,  and  A  as  defined  in  Problem  2.7: 

(a)  Find  x  +  y  and  x  —  y. 

(b)  Find  (x  -  y)'A(x  -  y). 

2.9  Using  B  and  x  in  Problem  2.7,  find  Bx  as  a  linear  combination  of  columns  of 
B  as  in  (2.67)  and  compare  with  Bx  found  in  Problem  2.7(a). 

2.10  Let 


A  = 


2 

1 


B  = 


1  4  2  \ 

5  0  3  /’ 


0 

1 


(a)  Show  that  (AB)'  =  B'A'  as  in  (2.27). 

(b)  Show  that  AI  =  A  and  that  IB  =  B. 

(c)  Find  |A|. 

2.11  Let 


a  = 


2 

1 

3 


(a)  Find  a'b  and  (a'b)2. 

(b)  Find  bb'  and  a'(bb')a. 

(c)  Compare  (a'b)2  with  a'(bb')a  and  thus  illustrate  (2.40). 
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2.12  Let 


A  = 


1  2  3  \ 

4  5  6, 
7  8  9  / 


(a  0  0  \ 

D=  0  b  0 
\  0  0  c  ) 


Find  DA,  AD,  and  DAD. 

2.13  Let  the  matrices  A  and  B  be  partitioned  as  follows: 


/  2  1 

2  \ 

/ill 

0 

3  2 

0 

,  B  = 

2  1  1 

2 

l  1  0 

1  J 

\  2  3  1 

2) 

(a)  Find  AB  as  in  (2.65)  using  the  indicated  partitioning. 

(b)  Check  by  finding  AB  in  the  usual  way,  ignoring  the  partitioning. 

2.14  Let 


A  = 


1  3 

2  0 


1 

-4 


Find  AB  and  CB.  Are  they  equal?  What  is  the  rank  of  A,  B,  and  C? 

2.15  Let 


A  = 


5  4 

2  -3 

3  7 


4 

1 

2 


\ 

/  1 

0 

1  \ 

,  B  = 

° 

1 

° 

/ 

\  1 

2 

3  / 

(a)  Find  tr(A)  and  tr(B). 

(b)  Find  A  +  B  and  tr(A  +  B).  Is  tr(A  +  B)  =  tr(A)  +  tr(B)? 


(c)  Find  |A|  and  |B|. 

(d)  Find  AB  and  |AB|.  Is  |AB|  =  |A||B|? 


2.16  Let 


3  4  3  \ 

4  8  6 

3  6  9  / 


(a)  Show  that  |A|  >  0. 

(b)  Using  the  Cholesky  decomposition  in  Section  2.7,  find  an  upper  triangular 
matrix  T  such  that  A  =  T'T. 
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2.17  Let 


(a)  Show  that  |A|  >  0. 

(b)  Using  the  Cholesky  decomposition  in  Section  2.7,  find  an  upper  triangular 
matrix  T  such  that  A  =  T'T. 

2.18  The  columns  of  the  following  matrix  are  mutually  orthogonal: 


(a)  Normalize  the  columns  of  A  by  dividing  each  column  by  its  length;  denote 
the  resulting  matrix  by  C. 

(b)  Show  that  C  is  an  orthogonal  matrix,  that  is,  C'C  =  CC'  =  I. 

2.19  Let 


(  1  1 

A=  -1  2 

V  0  1 


-2 

1 

-1 


(a)  Find  the  eigenvalues  and  associated  normalized  eigenvectors. 

(b)  Find  tr(A)  and  |A|  and  show  that  tr(A)  =  a,'  and  |A|  =  N  • 

2.20  Let 


A  = 


3  1 
1  0 
1  2 


1 

2 

0 


(a)  The  eigenvalues  of  A  are  1,  4,  —2.  Find  the  normalized  eigenvectors  and 
use  them  as  columns  in  an  orthogonal  matrix  C. 

(b)  Show  that  C'AC  =  D  as  in  (2. 1 1 1 ),  where  D  is  diagonal  with  the  eigenval¬ 
ues  of  A  on  the  diagonal. 

(c)  Show  that  A  =  CDC'  as  in  (2.109). 

2.21  For  the  positive  definite  matrix 


A  = 


2 

-1 


-1 

2 


calculate  the  eigenvalues  and  eigenvectors  and  find  the  square  root  matrix  A1/2 
as  in  (2.1 12).  Check  by  showing  that  (A1/2)2  =  A. 
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2.22  Let 


A  = 


3  6 
6  9 
-1  4 


-1 

4 

3 


(a)  Find  the  spectral  decomposition  of  A  as  in  (2.109). 

(b)  Find  the  spectral  decomposition  of  A2  and  show  that  the  diagonal  matrix 
of  eigenvalues  is  equal  to  the  square  of  the  matrix  D  found  in  part  (a),  thus 
illustrating  (2.115). 

(c)  Find  the  spectral  decomposition  of  A  1  and  show  that  the  diagonal  matrix 
of  eigenvalues  is  equal  to  the  inverse  of  the  matrix  D  found  in  part  (a),  thus 
illustrating  (2.116). 

2.23  Find  the  singular  value  decomposition  of  A  as  in  (2.1 17),  where 


/ 

4 

-5 

-1  \ 

7 

-2 

3 

-1 

4 

-3 

V 

8 

2 

6  ) 

2.24  If  j  is  a  vector  of  l’s,  as  defined  in  (2.11),  show  that  the  following  hold: 

(a)  j'a  =  a  j  =  ai  as  in  (2-37). 

(b)  j'A  is  a  row  vector  whose  elements  are  the  column  sums  of  A  as  in  (2.38). 

(c)  Aj  is  a  column  vector  whose  elements  are  the  row  sums  of  A  as  in  (2.38). 

2.25  Verify  (2.41);  that  is,  show  that  (x  —  y)'(x  —  y)  =  x'x  —  2x'y  +  y'y. 

2.26  Show  that  A' A  is  symmetric,  where  A  is  n  x  p. 

2.27  If  a  and  xi,  X2, . . .  ,  x„  are  all  p  x  I  and  A  is  p  x  p,  show  that  (2.42)-(2.45) 
hold: 

(a)  £?=!«'*  =  a' £?=!*•• 

(b)  £"=1  Ax;-  =  A  £"=1  x,- . 

(c)  E"=i(a'v)2  =  a'(£"_i  x/xpa. 

(d)  £"=i  Ax,- (Ax,)'  =  A(£"=1  x,-xJ)A'. 

a' 

2.28  Assume  that  A  =  (J J  is  2  x  /?,  x  is  p  x  1,  and  S  is  p  x  p. 

(a)  Show  that 


Ax  = 


aix  \ 

a2X  )  ’ 


as  in  (2.49). 
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(b)  Show  that 


ASA' 


a'jSai  a,  Sa?  \ 
a'2Sai  a'2Sa2  )  ’ 


as  in  (2.50). 

2.29  (a)  If  the  rows  of  A  are  denoted  by  a'-,  show  that  A' A  =  J2'i= t  a<a/  as  'n 

(2.51). 

(b)  If  the  columns  of  A  are  denoted  by  a (j),  show  that  AA'  =  YlP=i  a(;')a(  > 
as  in  (2.53). 

2.30  Show  that  (A')-1  =  (A-1)'  as  in  (2.75). 

2.31  Show  that  the  inverse  of  the  partitioned  matrix  given  in  (2.76)  is  correct  by 
multiplying  by 


/  An  ai2  \ 

\  a'12  <322  / 

to  obtain  an  identity. 

2.32  Show  that  the  inverse  of  B  +  cc'  given  in  (2.77)  is  correct  by  multiplying  by 
B  +  cc'  to  obtain  an  identity. 

2.33  Show  that  |cA|  =  c"|A|  as  in  (2.85). 

2.34  Show  that  IA"1 1  =  1  /|A|  as  in  (2.91). 

2.35  If  B  is  nonsingular  and  c  is  a  vector,  show  that  |B  +  cc'  =  |B|(1  +  c'B  1  c)  as 
in  (2.95). 

2.36  Show  that  tr(A'A)  =  tr(AA')  =  Yhj  afj  as  in  (2.98). 

2.37  Show  that  CC'  =  I  in  (2.102)  follows  from  C'C  =  I  in  (2.101 ). 

2.38  Show  that  the  eigenvalues  of  AB  are  the  same  as  those  of  BA,  as  noted  in 
Section  2.1 1.5. 

2.39  If  A1,/2  is  the  square  root  matrix  defined  in  (2.1 12),  show  that 

(a)  (A1/2)2  =  A  as  in  (2.114), 

(b)  | A1/2!2  =  |A|, 

(c)  | A*/2|  =  | A| !/2. 


CHAPTER  3 


Characterizing  and  Displaying 
Multivariate  Data 


We  review  some  univariate  and  bivariate  procedures  in  Sections  3.1,  3.2,  and  3.3  and 
then  extend  them  to  vectors  of  higher  dimension  in  the  remainder  of  the  chapter. 


3.1  MEAN  AND  VARIANCE  OF  A  UNIVARIATE  RANDOM  VARIABLE 

Informally,  a  random  variable  may  be  defined  as  a  variable  whose  value  depends  on 
the  outcome  of  a  chance  experiment.  Generally,  we  will  consider  only  continuous 
random  variables.  Some  types  of  multivariate  data  are  only  approximations  to  this 
ideal,  such  as  test  scores  or  a  seven-point  semantic  differential  (Likert)  scale  consist¬ 
ing  of  ordered  responses  ranging  from  strongly  disagree  to  strongly  agree.  Special 
techniques  have  been  developed  for  such  data,  but  in  many  cases,  the  usual  methods 
designed  for  continuous  data  work  almost  as  well. 

The  density  function  f(y)  indicates  the  relative  frequency  of  occurrence  of  the 
random  variable  y.  (We  do  not  use  Y  to  denote  the  random  variable  for  reasons 
given  at  the  beginning  of  Section  3.5.)  Thus,  if  f(y\ )  >  f(y 2),  then  points  in  the 
neighborhood  of  vi  are  more  likely  to  occur  than  points  in  the  neighborhood  of  V2- 
The  population  mean  of  a  random  variable  y  is  defined  (informally)  as  the  mean 
of  all  possible  values  of  y  and  is  denoted  by  ji.  The  mean  is  also  referred  to  as  the 
expected  value  of  y,  or  E(y).  If  the  density  f(y)  is  known,  the  mean  can  sometimes 
be  found  using  methods  of  calculus,  but  we  will  not  use  these  techniques  in  this  text. 

If  f(y)  is  unknown,  the  population  mean  //  will  ordinarily  remain  unknown  unless 
it  has  been  established  from  extensive  past  experience  with  a  stable  population.  If 
a  large  random  sample  from  the  population  represented  by  f(y)  is  available,  it  is 
highly  probable  that  the  mean  of  the  sample  is  close  to  /x. 

The  sample  mean  of  a  random  sample  of  n  observations  y\,  yi,  ■  ■  ■  ,  yn  is  given 
by  the  ordinary  arithmetic  average 


.V  = 


(3.1) 
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Generally,  y  will  never  be  equal  to  /x;  by  this  we  mean  that  the  probability  is  zero  that 
a  sample  will  ever  arise  in  which  y  is  exactly  equal  to  /!.  However,  y  is  considered  a 
good  estimator  for  /x  because  E(y)  =  p  and  var(y)  =  <r2/n,  where  cr2  is  the  vari¬ 
ance  of  y.  In  other  words,  y  is  an  unbiased  estimator  of  /i  and  has  a  smaller  variance 
than  a  single  observation  y.  The  variance  a2  is  defined  shortly.  The  notation  E(y) 
indicates  the  mean  of  all  possible  values  of  y;  that  is,  conceptually,  every  possible 
sample  is  obtained  from  the  population,  the  mean  of  each  is  found,  and  the  average 
of  all  these  sample  means  is  calculated. 

If  every  y  in  the  population  is  multiplied  by  a  constant  a.  the  expected  value  is 
also  multiplied  by  a : 


E(ay)  —  aE(y)  =  apt. 


(3.2) 


The  sample  mean  has  a  similar  property.  If  Zi  =  ayi  for  i  =  1,2,...  ,  n,  then 


z  =  ay. 


(3.3) 


The  variance  of  the  population  is  defined  as  var(y)  =  cr2  =  E(y  —  ji)2 . 
This  is  the  average  squared  deviation  from  the  mean  and  is  thus  an  indication  of 
the  extent  to  which  the  values  of  y  are  spread  or  scattered.  It  can  be  shown  that 
a2  =  E(y2)  -  p2. 

The  sample  variance  is  defined  as 


,2  n=l(yi-y)2 

n  —  1 


which  can  be  shown  to  be  equal  to 


s 


2 


e;=i  yf 


ny 


n  —  1 


(3.4) 


(3.5) 


The  sample  variance  s 2  is  generally  never  equal  to  the  population  variance  a2  (the 
probability  of  such  an  occurrence  is  zero),  but  it  is  an  unbiased  estimator  for  a2;  that 
is,  E(s 2)  =  a2.  Again  the  notation  E(s2)  indicates  the  mean  of  all  possible  sample 
variances.  The  square  root  of  either  the  population  variance  or  sample  variance  is 
called  the  standard  deviation. 

If  each  y  is  multiplied  by  a  constant  a ,  the  population  variance  is  multiplied  by 
a2,  that  is,  var(ay)  =  era2 .  Similarly,  if  Zi  =  ayj,  i  —  1,2, ,  n,  then  the  sample 
variance  of  z  is  given  by 


2  2 
as. 


(3.6) 
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3.2  COVARIANCE  AND  CORRELATION  OF  BIVARIATE 
RANDOM  VARIABLES 

3.2.1  Covariance 

If  two  variables  x  and  y  are  measured  on  each  research  unit  (object  or  subject),  we 
have  a  bivariate  random  variable  (x,  y).  Often  x  and  y  will  tend  to  covary;  if  one 
is  above  its  mean,  the  other  is  more  likely  to  be  above  its  mean,  and  vice  versa.  For 
example,  height  and  weight  were  observed  for  a  sample  of  20  college-age  males.  The 
data  are  given  in  Table  3.1. 

The  values  of  height  x  and  weight  y  from  Table  3 . 1  are  both  plotted  in  the  vertical 
direction  in  Figure  3.1.  The  tendency  for  x  and  y  to  stay  on  the  same  side  of  the  mean 


Table  3.1.  Height  and  Weight  for  a  Sample  of  20  College-age  Males 


Person 

Height 

x 

Weight 

y 

Person 

Height 

Weight 

y 

1 

69 

153 

11 

72 

140 

2 

74 

175 

12 

79 

265 

3 

68 

155 

13 

74 

185 

4 

70 

135 

14 

67 

112 

5 

72 

172 

15 

66 

140 

6 

67 

150 

16 

71 

150 

7 

66 

115 

17 

74 

165 

8 

70 

137 

18 

75 

185 

9 

76 

200 

19 

75 

210 

10 

68 

130 

20 

76 

220 

Figure  3.1.  Two  variables  with  a  tendency  to  covary. 
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is  clear  in  Figure  3.1.  This  illustrates  positive  covariance.  With  negative  covariance 
the  points  would  tend  to  deviate  simultaneously  to  opposite  sides  of  the  mean. 

The  population  covariance  is  defined  as  cov(x,  y)  =  axy  =  £[(.r  —  /xY)(y  — 
/ly)],  where  fix  and  pty  are  the  means  of  x  and  y,  respectively.  Thus  if  x  and  y  are 
usually  both  above  their  means  or  both  below  their  means,  the  product  (x  —  /ix )  (y  — 
/i  v )  will  typically  be  positive,  and  the  average  value  of  the  product  will  be  positive. 
Conversely,  if  x  and  y  tend  to  fall  on  opposite  sides  of  their  respective  means,  the 
product  will  usually  be  negative  and  the  average  product  will  be  negative,  it  can  be 
shown  that  axy  =  E{xy)  —  ptxpty. 

If  the  two  random  variables  x  and  y  in  a  bivariate  random  variable  are  added  or 
multiplied,  a  new  random  variable  is  obtained.  The  mean  of  x  +  y  or  of  xy  is  as 
follows: 


E(x  +  y)  —  E(x)  +  E(y)  (3.7) 

E(xy)  =  E (x)  E(y)  if  x  and  y  are  independent.  (3.8) 

Formally,  x  and  y  are  independent  if  their  joint  density  factors  into  the  product  of 
their  individual  densities:  f  (x,y)  =  g(x)h(y).  Informally,  x  and  y  are  independent 
if  the  random  behavior  of  either  of  the  variables  is  not  affected  by  the  behavior  of  the 
other.  Note  that  (3.7)  is  true  whether  or  not  x  and  y  are  independent,  but  (3.8)  holds 
only  for  x  and  y  independently  distributed. 

The  notion  of  independence  of  x  and  y  is  more  general  than  that  of  zero  covari¬ 
ance.  The  covariance  axy  measures  linear  relationship  only,  whereas  if  two  random 
variables  are  independent,  they  are  not  related  either  linearly  or  nonlinearly.  Inde¬ 
pendence  implies  axy  =  0,  but  csxy  —  0  does  not  imply  independence.  It  is  easy  to 
show  that  if  x  and  y  are  independent,  then  axy  =  0: 

oxy  =  E(xy  )  -  iixjiy 

=  E(x)E(y)  -  nxp,y  [by  (3.8)] 

—  M-v/t-v  l^xl^y  —  0- 

One  way  to  demonstrate  that  the  converse  is  not  true  is  to  construct  examples  of 
bivariate  x  and  y  that  have  zero  covariance  and  yet  are  related  in  a  nonlinear  way 
(the  relationship  will  have  zero  slope).  This  is  illustrated  in  Figure  3.2. 

If  x  and  y  have  a  bivariate  normal  distribution  (see  Chapter  4),  then  zero  covari¬ 
ance  implies  independence.  This  is  because  (1)  the  covariance  measures  only  linear 
relationships  and  (2)  in  the  bivariate  normal  case,  the  mean  of  v  given  x  (or  x  given 
y)  is  a  straight  line. 

The  sample  covariance  is  defined  as 


sxy  — 


E”=i  (-f  -x)(yi -7) 

n  —  1 


(3.9) 
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.v 


Figure  3.2.  A  sample  from  a  population  where  x  and  y  have  zero  covariance  and  yet  are 
dependent. 


It  can  be  shown  that 


■'tvy  — 


E"=i  xiy>  ~ nx  y 

n  —  1 


(3.10) 


Note  that  sxy  is  essentially  never  equal  to  axy  (for  continuous  data);  that  is,  the  prob¬ 
ability  is  zero  that  sxy  will  equal  axy.  It  is  true,  however,  that  sxy  is  an  unbiased 
estimator  for  axy,  that  is,  E(sxy)  —  axy. 

Since  sxy  ^  axy  in  any  given  sample,  this  is  also  true  when  axy  —  0.  Thus  when 
the  population  covariance  is  zero,  no  random  sample  from  the  population  will  have 
zero  covariance.  The  only  way  a  sample  from  a  continuous  bivariate  distribution  will 
have  zero  covariance  is  for  the  experimenter  to  choose  the  values  of  x  and  y  so  that 
sxy  =  0.  (Such  a  sample  would  not  be  a  random  sample.)  One  way  to  achieve  this  is 
to  place  the  values  in  the  form  of  a  grid.  This  is  illustrated  in  Figure  3.3. 

The  sample  covariance  measures  only  linear  relationships.  If  the  points  in  a  bivari¬ 
ate  sample  follow  a  curved  trend,  as,  for  example,  in  Figure  3.2,  the  sample  covari¬ 
ance  will  not  measure  the  strength  of  the  relationship.  To  see  that  sxy  measures  only 
linear  relationships,  note  that  the  slope  of  a  simple  linear  regression  line  is 


Pi 


E;=i  (*i  ~  -*)(.W  ~  ,v) 

E"=i<Ai  -  V>2 


sxy 


(3.11) 


Thus  sxy  is  proportional  to  the  slope,  which  shows  only  the  linear  relationship 
between  y  and  x. 

Variables  with  zero  sample  covariance  can  be  said  to  be  orthogonal.  By  (2.99), 
two  sets  of  numbers  a  1,02,  . . .  ,  a„  and  b\,b2,  ■  •  •  ,  bn  are  orthogonal  if  E/=i  a'^i  ~ 
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Figure  3.3.  A  sample  of  ( x ,  y)  values  with  zero  covariance. 


0.  This  is  true  for  the  centered  variables  x,-  —  x  and  y,-  —  y  when  the  sample  covariance 
is  zero,  that  is,  X7=i  (+'  —  x)(y,-  —  y)  =  0. 


Example  3.2.1.  To  obtain  the  sample  covariance  for  the  height  and  weight  data  in 
Table  3.1,  we  first  calculate  x,  y,  and  x,y,,  where  x  is  height  and  y  is  weight: 


_  69  +  74+ •••  +  76  ^ 

x  =  - =  71.45, 

20 

_  153  +  175  + •••  +  220 

y  =  - =  164.7, 

20 


20 


=  C69)(153)  +  (74)(175)  +  •  •  •  +  (76)  (220)  =  237,805. 

i=i 

Now,  by  (3.10),  we  have 


Svy  — 


Z7=i  x<yj  -  nx  y 

n  —  1 


237,805  -  (20) (7 1 .45) ( 164.7) 
19 


128.88. 


By  itself,  the  sample  covariance  128.88  is  not  very  meaningful.  We  are  not  sure  if 
this  represents  a  small,  moderate,  or  large  amount  of  relationship  between  y  and  x. 
A  method  of  standardizing  the  covariance  is  given  in  the  next  section.  □ 
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3.2.2  Correlation 

Since  the  covariance  depends  on  the  scale  of  measurement  of  a  and  y,  it  is  difficult  to 
compare  covariances  between  different  pairs  of  variables.  For  example,  if  we  change 
a  measurement  from  inches  to  centimeters,  the  covariance  will  change.  To  find  a 
measure  of  linear  relationship  that  is  invariant  to  changes  of  scale,  we  can  standard¬ 
ize  the  covariance  by  dividing  by  the  standard  deviations  of  the  two  variables.  This 
standardized  covariance  is  called  a  correlation.  The  population  correlation  of  two 
random  variables  x  and  y  is 


pxy  =  corr(.v ,  y)  = 


Ojcy 

r>x°y 


E[jx  -  px)(y  -  p,y)] 

Je(x  -  p.x)2jE(y  -  /Xy)1 


and  the  sample  correlation  is 


sxy  _  J2"=i(xi  ~x)(yj  -y) 
SxSy  ~ 


(3.12) 


(3.13) 


Either  of  these  correlations  will  range  between  —  1  and  1 . 

The  sample  correlation  rxy  is  related  to  the  cosine  of  the  angle  between  two  vec¬ 
tors.  Let  8  be  the  angle  between  vectors  a  and  b  in  Figure  3.4.  The  vector  from  the 
terminal  point  of  a  to  the  terminal  point  of  b  can  be  represented  as  c  =  b  —  a.  Then 
the  law  of  cosines  can  be  stated  in  vector  form  as 


a' a  +  b'b  —  (b  —  a/(b  —  a) 

cos  8  =  -  -  - - 

2V(a,a)(b,b) 


Figure  3.4.  Vectors  a  and  b  in  3-space. 
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a' a  +  b'b  -  (b'b  +  a' a  —  2a'b) 
2V(a'a)(b'b) 


vWaHbTj)' 

Since  cos(90°)  =  0,  we  see  from  (3.14)  that  a'b  =  0  when  d  —  90°.  Thus  a 
and  b  are  perpendicular  when  a'b  =  0.  By  (2.99),  two  vectors  a  and  b,  such  that 
a'b  =  0,  are  also  said  to  be  orthogonal.  Hence  orthogonal  vectors  are  perpendicular 
in  a  geometric  sense. 

To  express  the  correlation  in  the  form  given  in  (3.14),  let  the  n  observation  vec¬ 
tors  (jci ,  yi),  (x2,  yi),  . . .  ,  ( xn ,  yn)  in  two  dimensions  be  represented  as  two  vectors 
x'  =  (xi ,  X2, .  ■  ■  ,  x„)  and  y'  =  (vi,  y’2, . . .  ,  yn)  in  n  dimensions,  and  let  x  and  y 
be  centered  as  x  —  Tj  and  y  —  yj.  Then  the  cosine  of  the  angle  0  between  them  [see 
(3.14)]  is  equal  to  the  sample  correlation  between  x  and  v: 


_ (x  -  xj)'(y  -  yj) _ 

y[(x  -  “j)'(x  -  “j)][(y  -  yj)'(y  -  yj)] 


Z”=ifc  ~  *X)7  -  y) 


(3.15) 


Thus  if  the  angle  6  between  the  two  centered  vectors  x  —  Tj  and  y  —  yj  is  small  so 
that  cos  0  is  near  1,  rxy  will  be  close  to  1 .  If  the  two  vectors  are  perpendicular,  cos  0 
and  rxy  will  be  zero.  If  the  two  vectors  have  nearly  opposite  directions,  rxy  will  be 
close  to  —1. 


Example  3.2.2.  To  obtain  the  correlation  for  the  height  and  weight  data  of  Table  3.1, 
we  first  calculate  the  sample  variance  of  x: 


s 


2 

X 


£"=i  xf  ~  n*2 


102,379-  (20)(71.45)2 
19 


14.576. 


Then  sx  =  a/14.576  =  3.8179  and,  similarly,  sy  =  37.964.  By  (3.13),  we  have 


'  xy 


Vv 

SXSy 


128.88 

(3.81791(37.964) 


.889. 


□ 


3.3  SCATTER  PLOTS  OF  BIVARIATE  SAMPLES 

Figures  3.2  and  3.3  are  examples  of  scatter  plots  of  bivariate  samples.  In  Figure  3.1, 
the  two  variables  x  and  y  were  plotted  separately  for  the  data  in  Table  3. 1 .  Figure  3.5 
shows  a  bivariate  scatter  plot  of  the  same  data. 
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Figure  3.5.  Bivariate  scatter  plot  of  the  data  in  Figure  3.1. 

If  the  origin  is  shifted  to  (x,  y),  as  indicated  by  the  dashed  lines,  then  the  first  and 
third  quadrants  contain  most  of  the  points.  Scatter  plots  for  correlated  data  typically 
show  a  substantial  positive  or  negative  slope. 

A  hypothetical  sample  of  the  uncorrelated  variables  height  and  IQ  is  shown  in 
Figure  3.6.  We  could  change  the  shape  of  the  swarm  of  points  by  altering  the  scale 
on  either  axis.  But  because  of  the  independence  assumed  for  these  variables,  each 
quadrant  is  likely  to  have  as  many  points  as  any  other  quadrant.  A  tall  person  is  as 
likely  to  have  a  high  IQ  as  a  low  IQ.  A  person  of  low  IQ  is  as  likely  to  be  short  as  to 
be  tall. 
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Figure  3.6.  A  sample  of  data  from  a  population  where  x  and  y  are  uncorrelated. 
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3.4  GRAPHICAL  DISPLAYS  FOR  MULTIVARIATE  SAMPLES 

It  is  a  relatively  simple  procedure  to  plot  bivariate  samples  as  in  Section  3.3.  The 
position  of  a  point  shows  at  once  the  value  of  both  variables.  However,  for  three  or 
more  variables  it  is  a  challenge  to  show  graphically  the  values  of  all  the  variables 
in  an  observation  vector  y.  On  a  two-dimensional  plot,  the  value  of  a  third  variable 
could  be  indicated  by  color  or  intensity  or  size  of  the  plotted  point.  Four  dimensions 
might  be  represented  by  starting  with  a  two-dimensional  scatter  plot  and  adding  two 
additional  dimensions  as  line  segments  at  right  angles,  as  in  Figure  3.7.  The  “corner 
point"  represents  yi  and  yj,  whereas  y3  and  _y 4  are  given  by  the  lengths  of  the  two 
line  segments. 

We  will  now  describe  various  methods  proposed  for  representing  p  dimensions  in 
a  plot  of  an  observation  vector,  where  p  >  2. 

Profiles  represent  each  point  by  p  vertical  bars,  with  the  heights  of  the  bars  depicting 
the  values  of  the  variables.  Sometimes  the  profile  is  outlined  by  a  polygonal 
line  rather  than  bars. 

Stars  portray  the  value  of  each  (normalized)  variable  as  a  point  along  a  ray  from  the 
center  to  the  outside  of  a  circle.  The  points  on  the  rays  are  usually  joined  to 
form  a  polygon. 

Glyphs  (Anderson  1960)  are  circles  of  fixed  size  with  rays  whose  lengths  represent 
the  values  of  the  variables.  Anderson  suggested  using  only  three  lengths  of 
rays,  thus  rounding  the  variable  values  to  three  levels. 

Faces  (Chernoff  1973)  depict  each  variable  as  a  feature  on  a  face,  such  as  length 
of  nose,  size  of  eyes,  shape  of  eyes,  and  so  on.  Flury  and  Riedwyl  (1981) 
suggested  using  asymmetric  faces,  thus  increasing  the  number  of  representable 
variables. 


>3 


14 


Yi 


Figure  3.7.  Four-dimensional  plot. 
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Boxes  (Hartigan  1975)  show  each  variable  as  the  length  of  a  dimension  of  a  box. 

For  more  than  three  variables,  the  dimensions  are  partitioned  into  segments. 

Among  these  five  methods,  Chambers  and  Kleiner  (1982)  prefer  the  star  plots 
because  they  “combine  a  reasonably  distinctive  appearance  with  computational  sim¬ 
plicity  and  ease  of  interpretation.”  Commenting  on  the  other  methods,  they  state, 
“Profiles  are  not  so  easy  to  compare  as  a  general  shape.  Faces  are  memorable,  but 
they  are  more  complex  to  draw,  and  one  must  be  careful  in  assigning  variables  to 
parameters  and  in  choosing  parameter  ranges.  Faces  to  some  extent  disguise  the  data 
in  the  sense  that  individual  data  values  may  not  be  directly  comparable  from  the 
plot.” 


Table  3.2.  Percentage  of  Republican  Votes  in  Residential  Elections  in  Six  Southern  States 
for  Selected  Years 


State 

1932 

1936 

1940 

1960 

1964 

1968 

Missouri 

35 

38 

48 

50 

36 

45 

Maryland 

36 

37 

41 

46 

35 

42 

Kentucky 

40 

40 

42 

54 

36 

44 

Louisiana 

7 

11 

14 

29 

57 

23 

Mississippi 

4 

3 

4 

25 

87 

14 

South  Carolina 

2 

1 

4 

49 

59 

39 

Example  3.4.  The  data  in  Table  3.2  are  from  Kleiner  and  Flartigan  (1981).  For  these 
data,  the  preceding  five  graphical  devices  are  illustrated  in  Figure  3.8.  The  relative 
magnitudes  of  the  variables  can  be  compared  more  readily  using  stars  or  profiles  than 
faces.  □ 


3.5  MEAN  VECTORS 

It  is  a  common  practice  in  many  texts  to  use  an  uppercase  letter  for  a  variable  name 
and  the  corresponding  lowercase  letter  for  a  particular  value  or  observed  value  of 
the  random  variable,  for  example,  P(Y  >  y ) .  This  notation  is  convenient  in  some 
univariate  contexts,  but  it  is  often  confusing  in  multivariate  analysis,  where  we  use 
uppercase  letters  for  matrices.  In  the  belief  that  it  is  easier  to  distinguish  between  a 
random  vector  and  an  observed  value  than  between  a  vector  and  a  matrix,  throughout 
this  text  we  follow  the  notation  established  in  Chapter  2.  Uppercase  boldface  letters 
are  used  for  matrices  of  random  variables  or  constants,  lowercase  boldface  letters 
represent  vectors  of  random  variables  or  constants,  and  univariate  random  variables 
or  constants  are  usually  represented  by  lowercase  nonbolded  letters. 

Let  y  represent  a  random  vector  of  p  variables  measured  on  a  sampling  unit  (sub¬ 
ject  or  object).  If  there  are  n  individuals  in  the  sample,  the  n  observation  vectors  are 
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Figure  3.8.  Profiles,  stars,  glyphs,  faces,  and  boxes  of  percentage  of  Republican  votes  in 
six  presidential  elections  in  six  southern  states.  The  radius  of  the  circles  in  the  stars  is  50%. 
Assignments  of  variables  to  facial  features  are  1932,  shape  of  face;  1936,  length  of  nose;  1940, 
curvature  of  mouth;  1960,  width  of  mouth;  1964,  slant  of  eyes;  and  1968,  length  of  eyebrows. 
(From  the  Journal  of  the  American  Statistical  Association,  1981,  p.  262.) 


denoted  by  yi ,  y2, . . .  ,  y„,  where 


The  sample  mean  vector  y  can  be  found  either  as  the  average  of  the  n  observation 
vectors  or  by  calculating  the  average  of  each  of  the  p  variables  separately; 


(3.16) 
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where,  for  example,  y2  =  ^"=1  yn/n.  Thus  yj  is  the  mean  of  the  n  observations  on 
the  first  variable,  y2  is  the  mean  of  the  second  variable,  and  so  on. 

All  n  observation  vectors  yi,  y2, .  ■ .  ,  y„  can  be  transposed  to  row  vectors  and 
listed  in  the  data  matrix  Y  as  follows: 


m 


1 

2 


Y  = 


y« 


=  (units)  . 


V  y'n  ) 


(variables) 


1 

2 

j 

p 

yn 

yi2  •  • 

■  •  yij  ■ 

yip 

T21 

y22  •  ■ 

..  y2j  . 

■  •  yiP 

yn 

yn  ■ ' 

■  •  ytj  ■ 

■  •  ytp 

yn  1 

yn2  ‘  ‘ 

ynj 

ynp 

(3.17) 


Since  n  is  usually  greater  than  p,  the  data  can  be  more  conveniently  tabulated  by 
entering  the  observation  vectors  as  rows  rather  than  columns.  Note  that  the  first  sub¬ 
script  i  corresponds  to  units  (subjects  or  objects)  and  the  second  subscript  j  refers  to 
variables.  This  convention  will  be  followed  whenever  possible. 

In  addition  to  the  two  ways  of  calculating  y  given  in  (3.16),  we  can  obtain  y 
from  Y.  We  sum  the  n  entries  in  each  column  of  Y  and  divide  by  n,  which  gives  y. 
This  can  be  indicated  in  matrix  notation  using  (2.38), 


y  = 


(3.18) 


where  j'  is  a  vector  of  l's,  as  defined  in  (2.11).  For  example,  the  second  element  of 

j  Y  is 


(1,1,...  ,1) 


(  >’I2  \ 

yn 

\  yn  2  j 


=  &i2- 
;  =  1 


We  can  transpose  (3.18)  to  obtain 


y  =  -Yj. 

n 


(3.19) 


We  now  turn  to  populations.  The  mean  of  y  over  all  possible  values  in  the  popu¬ 
lation  is  called  the  population  mean  vector  or  expected  value  of  y.  It  is  defined  as  a 
vector  of  expected  values  of  each  variable, 
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(  yx  \ 

(  E{yx)  \ 
E(yi) 

(  fix  ) 

M2 

E(  y)  =  E 

77/,.  \ 

V  yP  /  V  E(yP)  /  V  / 


where  /ij  is  the  population  mean  of  the  /  th  variable. 

It  can  be  shown  that  the  expected  value  of  each  y  j  in  y  is  fxj,  that  is,  E(yj )  =  /x . 


Thus  the  expected  value  of  y  (over  all 

possible  samples)  is 

(  y\  \ 

(  E{y t)  \ 

(  Ml  ^ 

yi 

E  (y2) 

M2 

E(  y)  =  E 

\  yP  j 

v  E(-yP ) ) 

\  M/>  / 

=  M 


(3.21) 


Therefore,  y  is  an  unbiased  estimator  of  / jl .  We  emphasize  again  that  y  is  never  equal 
to  /u.. 


Example  3.5.  Table  3.3  gives  partial  data  from  Kramer  and  Jensen  (1969a).  Three 
variables  were  measured  (in  milliequivalents  per  100  g)  at  10  different  locations  in 
the  South.  The  variables  are 


y\  =  available  soil  calcium, 
y2  —  exchangeable  soil  calcium, 

_V3  =  turnip  green  calcium. 

To  find  the  mean  vector  y,  we  simply  calculate  the  average  of  each  column  and  obtain 

f  =  (28.1,7.18,3.089).  □ 


Table  3.3.  Calcium  in  Soil  and  Turnip  Greens 


Location 

Number 

yi 

,V2 

Y3 

1 

35 

3.5 

2.80 

2 

35 

4.9 

2.70 

3 

40 

30.0 

4.38 

4 

10 

2.8 

3.21 

5 

6 

2.7 

2.73 

6 

20 

2.8 

2.81 

7 

35 

4.6 

2.88 

8 

35 

10.9 

2.90 

9 

35 

8.0 

3.28 

10 

30 

1.6 

3.20 
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3.6  COVARIANCE  MATRICES 

The  sample  covariance  matrix  S  =  (sjk)  is  the  matrix  of  sample  variances  and 
covariances  of  the  p  variables: 


S  =  (sjk)  = 


/  sii 
*21 


*12 

*22 


*1 P  \ 
*2  p 


\  sp\  sp2  •  •  •  spp  / 


(3.22) 


In  S  the  sample  variances  of  the  p  variables  are  on  the  diagonal,  and  all  possible 
pairwise  sample  covariances  appear  off  the  diagonal.  The  /  th  row  (column)  contains 
the  covariances  of  yj  with  the  other  p  —  1  variables. 

We  give  three  approaches  to  obtaining  S.  The  first  of  these  is  to  simply  calculate 
the  individual  elements  Sjk ■  The  sample  variance  of  the  /'th  variable,  sjj  =  sj,  is 
calculated  as  in  (3.4)  or  (3.5),  using  the  j th  column  of  Y: 


°jj 


i  " 

= sj = — 7  iz(y' 

J  n  —  1  r— ' 

i= l 


■y,)2 


(3.23) 

(3.24) 


where  y j  is  the  mean  of  the  /th  variable,  as  in  (3.16).  The  sample  covariance  of  the 
/' th  and  £th  variables,  Sjk,  is  calculated  as  in  (3.9)  or  (3.10),  using  the  jth  and  fcth 
columns  of  Y : 


sjk  = - -  y \yu  -  yj)(yik  -  Vk)  (3-25) 

n  —  1  f— '  J 

i=l 

=  yijyik  -  nyjJkJ  ■  (3-26) 

Note  that  in  (3.23)  the  variance  Sjj  is  expressed  as  sj,  the  square  of  the  standard 
deviation  Sj ,  and  that  S  is  symmetric  because  sjk  =  Skj  in  (3.25).  Other  names 
used  for  the  covariance  matrix  are  variance  matrix ,  variance-covariance  matrix,  and 
dispersion  matrix. 

By  way  of  notational  clarification,  we  note  that  in  the  univariate  case,  the  sam¬ 
ple  variance  is  denoted  by  s2.  But  in  the  multivariate  case,  we  denote  the  sample 
covariance  matrix  as  S,  not  as  S2. 

The  sample  covariance  matrix  S  can  also  be  expressed  in  terms  of  the  observation 
vectors: 
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s  =  ■— r  f>  -  y)(y/  -  y)' 

/I  —  1  t— ' 

*  =  1 

•-(i-''5  ’,5>) 


(3.27) 

(3.28) 


Since  (y,-  -  y)'  =  (>’,■  i  -  yx,  y;2  -  y2>  •  •  •  >  yip  -  yP)>  the  element  in  the  (1,  1) 
position  of  (y,-  —  y)(y ,■  —  y)'  is  (yiX  —  yx)  ,  and  when  this  is  summed  over  i  as  in 
(3.27),  the  result  is  the  numerator  of  mi  in  (3.23).  Similarly,  the  (1,  2)  element  of 
(y /  -  y)(y;  -  yy  is  (yn  -  JiKyn  -  y2)>  which  sums  to  the  numerator  of  sn  in 
(3.25).  Thus  (3.27)  is  equivalent  to  (3.23)  and  (3.25),  and  likewise  (3.28)  produces 
(3.24)  and  (3.26). 

We  can  also  obtain  S  directly  from  the  data  matrix  Y  in  (3.17),  which  provides 
a  third  approach.  The  first  term  in  the  right  side  of  (3.26),  JT  y ijyik,  is  the  product 
of  the  yth  and  k\h  columns  of  Y,  whereas  the  second  term,  ny  yyk,  is  the  (jk) th 
element  of  ny  y'.  It  was  noted  in  (2.54)  that  Y'Y  is  obtained  as  products  of  columns 
of  Y.  By  (3.18)  and  (3.19),  y  =  Y'j/n  and  y'  =  j'Y/n;  and  using  (2.36),  we  have 
ny  y'  =  Y'(J/n)Y.  Thus  S  can  be  written  as 

y'y-y'(1j)y 

=  — ~rY'  ( I  -  - j)  Y  [by  (2.30)].  (3.29) 

n  —  1  \  n  / 


Expression  (3.29)  is  a  convenient  representation  of  S,  since  it  makes  direct  use  of 
the  data  matrix  Y.  However,  the  matrix  I  -  J/n  is  n  x  n  and  may  be  unwieldy  in 
computation  if  n  is  large. 

If  y  is  a  random  vector  taking  on  any  possible  value  in  a  multivariate  population, 
the  population  covariance  matrix  is  defined  as 


X  =  cov(y) 


(  CTn  (712  •  •  •  0\p  ) 

(721  cr22  •"  (7 2p 

\  °>1  (7p2  •  '  '  (7 pp  / 


(3.30) 


The  diagonal  elements  Ojj  —  <rj  are  the  population  variances  of  the  y’s,  and  the 
off-diagonal  elements  a jk  are  the  population  covariances  of  all  possible  pairs  of  y’s. 

The  notation  X  for  the  covariance  matrix  is  widely  used  and  seems  natural  because 
X  is  the  uppercase  version  of  a .  It  should  not  be  confused  with  the  same  symbol 
used  for  summation  of  a  series.  The  difference  should  always  be  apparent  from  the 
context.  To  help  further  distinguish  the  two  uses,  the  covariance  matrix  X  will  differ 
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in  typeface  and  in  size  from  the  summation  symbol  ]P.  Also,  whenever  they  appear 
together,  the  summation  symbol  will  have  an  index  of  summation,  such  as  X^/=i- 
The  population  covariance  matrix  in  (3.30)  can  also  be  found  as 

2  =  E[(y-/0(y -/*)'],  (3.31) 

which  is  analogous  to  (3.27)  for  the  sample  covariance  matrix.  The  p  x  p  matrix 
(y  —  fi)(y  —  ft)'  is  a  random  matrix.  The  expected  value  of  a  random  matrix  is 
defined  as  the  matrix  of  expected  values  of  the  corresponding  elements.  To  see  that 
(3.31)  produces  population  variances  and  covariances  of  the  p  variables  as  in  (3.30), 
note  that 


2  =  £[(y-M)(y-M)']  =  £ 


/  yi  -  mi  \ 

-  M2 


Or  -  mi.  yi  -  M2.  •  •  • .  yP  -  mp) 

I 

V  yp  -  iip  I 

(  (yi-Mi)2  (yi  -  Mi)(y2  -  M2)  •••  (yi  -  Mi)(yP  -  mp)\ 


=  E 


O2  -  M2)(yi  -  mi) 


(y2  -  M2) 


\(yP  -  Mp)(yi  -  mi)  (yP  -  MP)(y2  -  M2) 


/  E  (yi  —  Ml)2 


E(y2 

-  M2)(yi 

-  Ml) 

\E(yp 

-  MpXyi 

-  Ml) 

(  (Til 

<712  •" 

(7lp\ 

<721 

a22  '  ■  ■ 

(72  p 

<7p2  ••• 

app) 

E(yi  -  Mi)(T2  -  M2) 
E(y2  -  M2)2 


(J2  -  M2)(yP  -  Mp) 

(>’p  -  MP)2  / 

^(yi  -  mi )(.»  -  Mp)  ^ 
£(V2  -  M2)(yP  -  Mp) 

£(yp  -  Mp)2  > 


It  can  be  easily  shown  that  X  can  be  expressed  in  a  form  analogous  to  (3.28): 

2  =  E(yy')  -  mm-  (3.32) 

Since  E(sjk )  =  crjk  for  all  /,  k,  the  sample  covariance  matrix  S  is  an  unbiased 
estimator  for  2: 


E(S)  =  2.  (3.33) 

As  in  the  univariate  case,  we  note  that  it  is  the  average  of  all  possible  values  of  S  that 
is  equal  to  2.  Generally,  S  will  never  be  equal  to  2. 
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Example  3.6.  To  calculate  the  sample  covariance  matrix  for  the  calcium  data  of 
Table  3.3  using  the  computational  forms  (3.24)  and  (3.26),  we  need  the  sum  of 
squares  of  each  column  and  the  sum  of  products  of  each  pair  of  columns.  We  illus¬ 
trate  the  computation  of  513. 

10 

ynyn  =  (35) (2.80)  +  (35)(2.70)  +  •  •  •  +  (30)(3.20)  =  885.48. 

;=i 

From  Example  3.5  we  have  yq  =  28.1  and  v3  =  3.089.  By  (3.26),  we  obtain 
.s- 13  =  -[885.48  -  10(28.1)(3.089)]  =  =  1.9412. 

Continuing  in  this  fashion,  we  obtain 


140.54 

49.68 

1.94  \ 

49.68 

72.25 

3.68  . 

□ 

1.94 

3.68 

.25  / 

3.7  CORRELATION  MATRICES 


The  sample  correlation  between  the  yth  and  Arth  variables  is  defined  in  (3.13)  as 


rJk  = 


sjk 

y/sjjskk 


s  jk 
SjSk 


(3.34) 


The  sample  correlation  matrix  is  analogous  to  the  covariance  matrix  with  correla¬ 
tions  in  place  of  covariances: 


R  =  (rjk) 


/  l  n  2 

r 21  1 


rip  \ 
r2p 


\rpi  rp2  ■■■  1  / 


(3.35) 


The  second  row,  for  example,  contains  the  correlation  of  y2  with  each  of  the  y’s 
(including  the  correlation  of  y2  with  itself,  which  is  1).  Of  course,  the  matrix  R  is 
symmetric,  since  rjp  =  rpj . 

The  correlation  matrix  can  be  obtained  from  the  covariance  matrix,  and  vice  versa. 
Define 


D.s  =  diag(v/TiT,  *J~s?i,  •  •  •  ,  sf^pp) 

—  diagfs'i ,  ,S’2, ...  ,sp) 
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/si  0  0  \ 

0  s2  ■  ■■  0 

V  0  0  •  •  •  sp  ) 

Then  by  (2.57) 

R  =  D71SD7>, 

S  =  Dv  RDV 


(3.36) 


(3.37) 

(3.38) 


The  population  correlation  matrix  analogous  to  (3.35)  is  defined  as 


P  p  =  (Pjk) 


/  1  p\2  Pip  \ 

I  P21  1  '  '  '  P2p  \ 


\  Ppl  Pp2  '  '  '  1  / 


where 


Pjk  — 


ajk 

CSjOk 


(3.39) 


as  in  (3.12). 

Example  3.7.  In  Example  3.6,  we  obtained  the  sample  covariance  matrix  S  for  the 
calcium  data  in  Table  3.3.  To  obtain  the  sample  correlation  matrix  for  the  same  data, 
we  can  calculate  the  individual  elements  using  (3.34)  or  use  the  direct  matrix  opera¬ 
tion  in  (3.37).  The  diagonal  matrix  Dv  can  be  found  by  taking  the  square  roots  of  the 
diagonal  elements  of  S, 


/  11.8551  0  0  \ 

Dj  =  I  0  8.4999  0 

\  0  0  .5001  / 

(note  that  we  have  used  the  unrounded  version  of  S  for  computation).  Then  by  (3.37), 

/  1.000  .493  .327  \ 

R  =  D71SD71=  .493  1.000  .865  . 

\  .327  .865  1.000  J 


Note  that  .865  >  .493  >  .327,  which  is  a  different  order  than  that  of  the  covariances 
in  S  in  Example  3.6.  Thus  we  cannot  compare  covariances,  even  within  the  same 
matrix  S.  □ 
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3.8  MEAN  VECTORS  AND  COVARIANCE  MATRICES 
FOR  SUBSETS  OF  VARIABLES 


3.8.1  Two  Subsets 

Sometimes  a  researcher  is  interested  in  two  different  kinds  of  variables,  both  mea¬ 
sured  on  the  same  sampling  unit.  This  corresponds  to  type  2  data  in  Section  1 .4.  For 
example,  several  classroom  behaviors  are  observed  for  students,  and  during  the  same 
time  period  (the  basic  experimental  unit)  several  teacher  behaviors  are  also  observed. 
The  researcher  wishes  to  study  the  relationships  between  the  pupil  variables  and  the 
teacher  variables. 

We  will  denote  the  two  subvectors  by  y  and  x,  with  p  variables  in  y  and  q  variables 
in  x.  Thus  each  observation  vector  in  a  sample  is  partitioned  as 


/  yt  i  \ 

yip 

Xj  1 

V  xiq  J 


i  =  1,2,...  ,  n. 


(3.40) 


Hence  there  are  p  +  q  variables  in  each  of  n  observation  vectors.  In  Chapter  10 
we  will  discuss  regression  of  the  y’s  on  the  x’s,  and  in  Chapter  1 1  we  will  define  a 
measure  of  correlation  between  the  y’s  and  the  x’s. 

For  the  sample  of  n  observation  vectors,  the  mean  vector  and  covariance  matrix 
have  the  form 


(  y  i  \ 


y_P 

X\ 


\  xq  ) 


S  = 


(3.41) 


(3.42) 


where  Syv  is  p  x  p,  Sy*  is  p  x  q,  S.ly  is  q  x  p,  and  S.v.v  is  q  x  q.  Note  that  because 
of  the  symmetry  of  S, 


c  _  c' 

axy  —  ay;r 


(3.43) 


Thus  (3.42)  could  be  written 
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To  illustrate  (3.41)  and  (3.42),  let  p  =  2  and  q  —  3.  Then 


(3.44) 


y 

X 


S  = 


/  -1  \ 

■V2 

X\ 

X2 

V  X3  ) 


'  s 2 
\v  1 

Sy2yi 

syiyi 

s2 

S)’2 

■S.vi-ri 

sy2xi 

sy\x2 

sy2x2 

syi*3 

sy2x3 

Ariyi 

sxm 

s2 

■S’*l*2 

4*1*3 

s*2yi 

SX2y2 

SX2Xl 

^*2 

4*2*3 

-  s*3yi 

sx2y2 

4*3*1 

■y*3*2 

^*3  / 

The  pattern  in  each  of  Syy,  Svx,  Sxv ,  and  Svx  is  clearly  seen  in  this  illustration.  For 
example,  the  first  row  of  Sy  v  has  the  covariance  of  y\  with  each  of  x\ ,  xj,  xy  the 
second  row  exhibits  covariances  of  yi  with  the  three  x’s.  On  the  other  hand,  Sxy  has 
as  its  hist  row  the  covariances  of  x\  with  yi  and  y2.  and  so  on.  Thus  SA  V  =  S',  as 
noted  in  (3.43). 

The  analogous  population  results  for  a  partitioned  random  vector  are 


E 


E(  y)  \  /  My 
E(x)  )  \  p.x 


COY 


(3.45) 

(3.46) 


where  X*y  =  Xyx.  The  submatrix  %yy  is  a  p  x  p  covariance  matrix  containing  the 
variances  of  y\,  y2,  ■  ■  ■  ,  )’p  on  the  diagonal  and  the  covariance  of  each  y j  with  each 
yk  off  the  diagonal.  Similarly,  is  the  q  x  q  covariance  matrix  of  xi,  xj,  ■  ■  ■  ,  xq . 
The  matrix  %yx  is  p  x  q  and  contains  the  covariance  of  each  yj  with  each  xk-  The 
covariance  matrix  %yx  is  also  denoted  by  cov(y,  x),  that  is, 


cov(y,  x)  =  Xyx. 


(3.47) 


Note  the  difference  in  meaning  between  cov(^)  =  X  in  (3.46)  and  cov(y,  x)  =  Xyx 
in  (3.47);  cov(^)  involves  a  single  vector  containing  p  +  q  variables,  and  cov(y,  x) 
involves  two  vectors. 

If  x  and  y  are  independent,  then  %yx  =  O.  This  means  that  each  y j  is  uncorrelated 
with  each  x>  so  that  oyjXk  —  0  for  j  =  1,2,...  ,  p;k  =  1,2,...  ,q. 
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Example  3.8.1.  Reaven  and  Miller  (1979;  see  also  Andrews  and  Herzberg  1985, 
pp.  215-219)  measured  five  variables  in  a  comparison  of  normal  patients  and  diabet¬ 
ics.  In  Table  3.4  we  give  partial  data  for  normal  patients  only.  The  three  variables  of 
major  interest  were 


xi  —  glucose  intolerance, 

xo  =  insulin  response  to  oral  glucose, 

*3  =  insulin  resistance. 

The  two  additional  variables  of  minor  interest  were 


}’i  =  relative  weight, 

y2  —  fasting  plasma  glucose. 


The  mean  vector,  partitioned  as  in  (3.41),  is 


(3.41), 

is 

(  7i  \ 

/  .918 

V2 

90.41 

X\ 

= 

340.83 

X2 

171.37 

V  X3  ) 

V  97.78 

The  covariance  matrix,  partitioned  as  in  the  illustration  following  (3.44),  is 


( 

.0162 

.2160 

.7872 

-.2138 

2.189 

S  \ 

.2160 

70.56 

26.23 

-23.96 

-20.84 

-(si’ 

r ) = 

&XX  / 

.7872 

-.2138 

26.23 

-23.96 

1106 

396.7 

396.7 

2382 

108.4 

1143 

\ 

2.189 

-20.84 

108.4 

1143 

2136 

Notice  that  Sy  v  and  Sxx  are  symmetric  and  that  S  vv  is  the  transpose  of  Svx.  □ 


3.8.2  Three  or  More  Subsets 

In  some  cases,  three  or  more  subsets  of  variables  are  of  interest.  If  the  observation 
vector  y  is  partitioned  as 


/  yi  \ 

Y2 


y  = 


V  y  *  / 


Table  3.4.  Relative  Weight,  Blood  Glucose,  and  Insulin  Levels 


Patient 

Number 

yi 

V2 

Xl 

x2 

*3 

1 

.81 

80 

356 

124 

55 

2 

.95 

97 

289 

117 

76 

3 

.94 

105 

319 

143 

105 

4 

1.04 

90 

356 

199 

108 

5 

1.00 

90 

323 

240 

143 

6 

.76 

86 

381 

157 

165 

7 

.91 

100 

350 

221 

119 

8 

1.10 

85 

301 

186 

105 

9 

.99 

97 

379 

142 

98 

10 

.78 

97 

296 

131 

94 

11 

.90 

91 

353 

221 

53 

12 

.73 

87 

306 

178 

66 

13 

.96 

78 

290 

136 

142 

14 

.84 

90 

371 

200 

93 

15 

.74 

86 

312 

208 

68 

16 

.98 

80 

393 

202 

102 

17 

1.10 

90 

364 

152 

76 

18 

.85 

99 

359 

185 

37 

19 

.83 

85 

296 

116 

60 

20 

.93 

90 

345 

123 

50 

21 

.95 

90 

378 

136 

47 

22 

.74 

88 

304 

134 

50 

23 

.95 

95 

347 

184 

91 

24 

.97 

90 

327 

192 

124 

25 

.72 

92 

386 

279 

74 

26 

1.11 

74 

365 

228 

235 

27 

1.20 

98 

365 

145 

158 

28 

1.13 

100 

352 

172 

140 

29 

1.00 

86 

325 

179 

145 

30 

.78 

98 

321 

222 

99 

31 

1.00 

70 

360 

134 

90 

32 

1.00 

99 

336 

143 

105 

33 

.71 

75 

352 

169 

32 

34 

.76 

90 

353 

263 

165 

35 

.89 

85 

373 

174 

78 

36 

.88 

99 

376 

134 

80 

37 

1.17 

100 

367 

182 

54 

38 

.85 

78 

335 

241 

175 

39 

.97 

106 

396 

128 

80 

40 

1.00 

98 

277 

222 

186 

41 

1.00 

102 

378 

165 

117 

42 

.89 

90 

360 

282 

160 

43 

.98 

94 

291 

94 

71 

44 

.78 

80 

269 

121 

29 

45 

.74 

93 

318 

73 

42 

46 

.91 

86 

328 

106 

56 
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where  yi  has  pi  variables,  y2  has  p2 ,  ...  ,  y k  has  pk,  with  p  —  pi  +  P2  +  ■  ■  ■  +  Pk , 
then  the  sample  mean  vector  and  covariance  matrix  are  given  by 


y  = 


s  = 


Si!  S12  •••  S  ik 
S21  S22  •  •  •  S2  k 

Ski  Sk2  ■  ■  ■  S  kk 


(3.48) 


(3.49) 


The  p2  x  pk  submatrix  S2 k,  for  example,  contains  the  covariances  of  the  variables  in 
y2  with  the  variables  in  y k . 

The  corresponding  population  results  are 


(3.50) 


(3.51) 


3.9  LINEAR  COMBINATIONS  OF  VARIABLES 
3.9.1  Sample  Properties 

We  are  frequently  interested  in  linear  combinations  of  the  variables  yi,  y2>  •  •  •  ,  >’;?• 
For  example,  two  of  the  types  of  linear  functions  we  use  in  later  chapters  are  (1)  linear 
combinations  that  maximize  some  function  and  (2)  linear  combinations  that  compare 
variables,  for  example,  y  1  —  y^.  In  this  section,  we  investigate  the  means,  variances, 
and  covariances  of  linear  combinations. 

Let  a\,  02, . . .  ,  ap  be  constants  and  consider  the  linear  combination  of  the  ele¬ 
ments  of  the  vector  y, 

z  =  aiyi  +  a2y2-\ - \-apyp  =  ay,  (3.52) 

where  2I  —  (ai,  c?2,  •  •  •  ,  ap).  If  the  same  coefficient  vector  a  is  applied  to  each  y,  in 
a  sample,  we  have 


LINEAR  COMBINATIONS  OF  VARIABLES 


67 


Zi  =  aiyn  +  a^yn  H - 1-  apyip 

=  a'y,- ,  ;  =  1,2, ...,«.  (3.53) 

The  sample  mean  of  s  can  be  found  either  by  averaging  the  n  values  zi  —  a'v ] ,  zi  = 
a'v 2- ...  ,Zn  =  a'y,,  or  as  a  linear  combination  of  y,  the  sample  mean  vector  of  yi, 

y2,  •••  ,yB: 


z 


i  " 

-J^zi  =  a'y. 

n  ‘  J 


i= 1 


(3.54) 


The  result  in  (3.54)  is  analogous  to  the  univariate  result  (3.3),  z  —  a'y,  where  zi  = 
ayt,  i  —  1,2,...  ,  n. 

Similarly,  the  sample  variance  of  zt  =  a'y,-,  i  —  1.2,...  ,  n,  can  be  found  as  the 
sample  variance  of  zi,  Z2,  ■  ■  ■  ,Zn  or  directly  from  a  and  S,  where  S  is  the  sample 
covariance  matrix  of  yi,  y2, . . .  ,  y„: 


2 

z 


T!U(zi-z)2 

n  —  1 


a'Sa. 


(3.55) 


Note  that  .v?  =  a'Sa  is  the  multivariate  analogue  of  the  univariate  result  in  (3.6), 
s ?  =  a2s2,  where  zi  =  ayt,  i  =  1,2,...  ,  n,  and  s 2  is  the  variance  of  yi,  y2,  •  •  •  ,  yn- 

Since  a  variance  is  always  nonnegative,  we  have  s2  >  0,  and  therefore  a'Sa  >  0, 
for  every  a.  Hence  S  is  at  least  positive  semidefmite  (see  Section  2.7) .  If  the  variables 
are  continuous  and  are  not  linearly  related,  and  if  n  —  1  >  p  (so  that  S  is  full  rank), 
then  S  is  positive  definite  (with  probability  1). 

If  we  define  another  linear  combination  w  =  b'y  =  b\y\  +  /?2.V2  +  •  •  •  +  bpyp, 
where  b'  =  (b\ .  /?2, . . .  ,  bp)  is  a  vector  of  constants  different  from  a',  then  the 
sample  covariance  of  z  and  w  is  given  by 


szw 


L”=l(Zf  -  z)(Wj  -  w ) 

n  —  1 


a'Sb. 


The  sample  correlation  between  z  and  w  is  readily  obtained  as 


rzw 


S2S2 


a'Sb 

V(a'Sa)(b'SbT 


(3.56) 


(3.57) 


We  now  denote  the  two  constant  vectors  a  and  b  as  ai  and  a2  to  facilitate  later 
expansion  to  more  than  two  such  vectors.  Let 
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and  define 


Then  we  can  factor  y  from  this  expression  by  (2.49): 

z=(a'l)J  =  A^ 


If  we  evaluate  the  bivariate  z;  for  each  p-variate  y,  in  the  sample,  we  obtain 
z ,•  =  Ay, ,  i  =  1,2,...  ,n,  and  the  average  of  z  over  the  sample  can  be  found  from  y : 


z  =  _ 


2 1 
22 


aiy 

a;y 


[by  (3.54)] 


1  1  y  =  Ay 


[by  (2.49)]. 


(3.58) 

(3.59) 


We  can  use  (3.55)  and  (3.56)  to  construct  the  sample  covariance  matrix  for  z: 


Sz 


■s2 

SZIZ2  | 

SZ2Zl 

4  ) 

a'jSaj 

a,  Sa2 

a(,Sai 

a2Sa2 

By  (2.50),  this  factors  into 


Sz 


J  )  S(ai,  a2)  =  ASA'. 

a2  / 


(3.60) 


(3.61) 


The  bivariate  results  in  (3.59)  and  (3.61)  can  be  readily  extended  to  more  than  two 
linear  combinations.  (See  principal  components  in  Chapter  12,  for  instance,  where 
we  attempt  to  transform  the  y’s  to  a  few  dimensions  that  capture  most  of  the  infor¬ 
mation  in  the  y’s.)  If  we  have  k  linear  transformations,  they  can  be  expressed  as 


2t  =  «it.Vi  +  fli2y2  H - H  alpyp  =  a'Ly 

22  =  fl2tyt  +  a2iy2  H - b  a2pyp  =  a'2y 


Zk  =  akiyi  +  ak2y2  H - b  akpyp  =  a[.y 
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or  in  matrix  notation. 


<  Z1  ^ 

(  axy  > 

( a,i  \ 

Z2 

— 

a2y 

— 

a2 

V  «  J 

v  a*y ) 

l  a*  / 

y  =  Ay 


[by  (2.47)], 


where  z  is  k  x  1,  A  is  k  x  p,  and  y  is  p  x  1  (we  typically  have  k  <  p).  If  z,-  =  Ay,- 
is  evaluated  for  all  y,-,  i  =  1,2,...  ,  n,  then  by  (3.54)  and  (2.49),  the  sample  mean 
vector  of  the  z’s  is 


(  a'xy  > 

( a;  \ 

'Z  — 

a2y 

a2 

A  — 

v  <-y  y 

l a*  / 

y  =  Ay. 


(3.62) 


By  an  extension  of  (3.60),  the  sample  covariance  matrix  of  the  z’s  becomes 


S_-  = 


axSai 

axSa2  •• 

■  •  a'  Sa* 

a[,Sai 

a2Sa2  •  - 

■  •  a(,Sa* 

ax.Sai 

ax.Sa2  •  - 

■  ■  a[.Sa  k 

ax  (Sai , 

Sa2, 

■  ■  ,  Sa*) 

a,(Sai, 

Sa2, 

•  •  .  Sa*) 

a^(Sai, 

Sa2, 

•  •  ,  Sa  *) 

/a'i  \ 

a2 

V  4 ) 

(*\\ 

a2 

v  k ) 

ASA'. 


(Sai,Sa2>...  ,Sa*)  [by  (2.47)] 


S(ai,  a2, ...  ,&k)  [by  (2.48)] 


(3.63) 


(3.64) 


Note  that  by  (3.63)  and  (3.64),  we  have 


k 

tr(ASA')  =  ^Ta'Sa,. 


(3.65) 
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A  slightly  more  general  linear  transformation  is 

z  ,•  =  Ay,-  +  b,  i  =  l,2,...,  n.  (3.66) 

The  sample  mean  vector  and  covariance  matrix  of  z  are  given  by 

z  =  Ay  +  b,  (3.67) 

S,  =  ASA'.  (3.68) 

Example  3.9.1.  Timm  (1975,  p.  233;  1980,  p.  47)  reported  the  results  of  an  experi¬ 
ment  where  subjects  responded  to  “probe  words”  at  five  positions  in  a  sentence.  The 
variables  are  response  times  for  the  / tli  probe  word,  yj,  j  —  1,  2, . . .  ,5.  The  data 
are  given  in  Table  3.5. 


Table  3.5.  Response  Times  for  Five  Probe  Word  Positions 


Subject 

Number 

yi 

V2 

lb 

Y4 

T5 

1 

51 

36 

50 

35 

42 

2 

27 

20 

26 

17 

27 

3 

37 

22 

41 

37 

30 

4 

42 

36 

32 

34 

27 

5 

27 

18 

33 

14 

29 

6 

43 

32 

43 

35 

40 

7 

41 

22 

36 

25 

38 

8 

38 

21 

31 

20 

16 

9 

36 

23 

27 

25 

28 

10 

26 

31 

31 

32 

36 

11 

29 

20 

25 

26 

25 

These  variables  are  commensurate  (same  measurement  units  and  similar  means 
and  variances),  and  the  researcher  may  wish  to  examine  some  simple  linear  combi¬ 
nations.  Consider  the  following  linear  combination  for  illustrative  purposes: 

z  =  3y  i  -  2 y2  +  4y3  -  y4  +  ys 
=  (3,  —2, 4,  —  1 ,  l)y  =  a'y. 

If  z  is  calculated  for  each  of  the  11  observations,  we  obtain  z,\  —  288,  zo  = 
155,...  ,  zn  =  146  with  mean  z  =  197.0  and  variance  =  2084.0.  These 
same  results  can  be  obtained  using  (3.54)  and  (3.55).  The  sample  mean  vector  and 
covariance  matrix  for  the  data  are 
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25.43  \ 
28.36 
41.13 
31.68 
58.22 

Then,  by  (3.54), 

/  36.09  \ 

25.55 

z  =  a'y  =  (3,  -2,4, -1,1)  34.09  =  197.0, 

27.27 
v  30.73  / 

and  by  (3.55),  s:  =  a'Sa  =  2084.0. 

We  now  define  a  second  linear  combination: 


36.09  \ 

/  65.09 

33.65 

47.59 

36.77 

25.55 

33.65 

46.07 

28.95 

40.34 

34.09 

,  S  =  47.59 

28.95 

60.69 

37.37 

27.27 

36.77 

40.34 

37.37 

62.82 

30.73  ) 

1  25.43 

28.36 

41.13 

31.68 

w  =  yi  +  3yo  -  V3  +  )4  -  2 y5 
=  (1,3, —1,  l,-2)y  =  b'y. 

The  sample  mean  and  variance  of  w  are  uT  =  b  y  =  44.45  and  s =  b'Sb  =  605.67. 
The  sample  covariance  of  z  and  w  is,  by  (3.56),  .y-u,  =  a'Sb  =  40.2. 

Using  (3.57),  we  find  the  sample  correlation  between  z  and  w  to  be 


.0358. 


We  now  define  the  three  linear  functions 


zi=yi+  yi  +  B  +  y4  +  vs 
z 2  =  2vi  -  3y2  +  J3  -  2.V4  -  ys 
23  =  ~y i  -  2y2  +  y3  -  2y4  +  3 ys, 

which  can  be  written  in  matrix  form  as 

(  Vl  \ 

(  Zl\  (  1  11  1  1  \  >’2 

22  =  2—31—2—1)  >3  , 

V  23  /  V  -1  -2  1  -2  3  /  }’4 

V  / 


or 


z  =  Ay. 


3.9.2  Population  Properties 

The  sample  results  in  Section  3.9.1  for  linear  combinations  have  population  coun¬ 
terparts.  Let  z  =  a'y.  where  a  is  a  vector  of  constants.  Then  the  population  mean  of 
z  is 

E{z)  =  E{  a'y)  =  a'E(y)  =  a>,  (3.69) 

and  the  population  variance  is 

cr?  =  var(a'y)  =  a'Xa.  (3.70) 

Let  w  =  b  y,  where  b  is  a  vector  of  constants  different  from  a.  The  population 
covariance  of  z  =  a'y  and  w  —  b  y  is 

cov(z,  w)  =  azw  =  a'Sb.  (3.71) 


By  (3.12)  the  population  correlation  of  z  and  w  is 
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Pzw  =  corr(a'y,  b'y)  =  — — 

a'2b 

_  V(a'2a)(b'2b)' 


(3.72) 


If  Ay  represents  several  linear  combinations,  the  population  mean  vector  and 
covariance  matrix  are  given  by 


E  (Ay)  =  AE  (y)  =  Aja,  (3.73) 

cov(Ay)  =  ASA'.  (3.74) 


The  more  general  linear  transformation  z  =  Ay  +  b  has  population  mean  vector  and 
covariance  matrix 


£(Ay  +  b)  =  AE(y)  +  b  =  Aja  +  b,  (3.75) 

cov(Ay  +  b)  =  ASA'.  (3.76) 


3.10  MEASURES  OF  OVERALL  VARIABILITY 

The  covariance  matrix  contains  the  variances  of  the  p  variables  and  the  covariances 
between  all  pairs  of  variables  and  is  thus  a  multifaceted  picture  of  the  overall  vari¬ 
ation  in  the  data.  Sometimes  it  is  desirable  to  have  a  single  numerical  value  for  the 
overall  multivariate  scatter.  One  such  measure  is  the  generalized  sample  variance, 
defined  as  the  determinant  of  the  covariance  matrix: 

Generalized  sample  variance  =  |S|.  (3.77) 

The  generalized  sample  variance  has  a  geometric  interpretation.  The  extension  of 
an  ellipse  to  more  than  two  dimensions  is  called  a  hyperellipsoid.  A  p -dimensional 
hyperellipsoid  (y— yyS_1(y— y)  =  a2,  centered  at  v  and  based  on  S  " 1  to  standardize 
the  distance  to  the  center,  contains  a  proportion  of  the  observations  yi,  y2,  ■ . .  .  y« 
in  the  sample  (if  S  were  replaced  by  2,  the  value  of  a2  could  be  determined  by 
tables  of  the  chi-square  distribution;  see  property  3  in  Section  4.2).  The  ellipsoid 
(y  —  y)'S_ 1  (v  —  y)  =  a2  has  axes  proportional  to  the  square  roots  of  the  eigenvalues 
of  S.  It  can  be  shown  that  the  volume  of  the  ellipsoid  is  proportional  to  |S| 1/2.  If  the 
smallest  eigenvalue  kp  is  zero,  there  is  no  axis  in  that  direction,  and  the  ellipsoid 
lies  wholly  in  a  (p  —  1) -dimensional  subspace  of  p-space.  Consequently,  the  volume 
in  p-space  is  zero.  This  can  also  be  seen  by  (2.108),  |S|  =  •  •  •  kp.  Hence,  if 

kp  =  0,  |S|  =  0.  A  zero  eigenvalue  indicates  a  redundancy  in  the  form  of  a  linear 
relationship  among  the  variables.  (As  will  be  seen  in  Section  12.7,  the  eigenvector 
corresponding  to  the  zero  eigenvalue  reveals  the  form  of  the  linear  dependency.)  One 
solution  to  the  dilemma  when  kp  =  0  is  to  remove  one  or  more  variables. 
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Another  measure  of  overall  variability,  the  total  sample  variance ,  is  simply  the 
trace  of  S: 


Total  sample  variance  =  sii  +  S22  H - +  spp  —  tr(S).  (3.78) 

This  measure  of  overall  variation  ignores  covariance  structure  altogether  but  is  found 
useful  for  comparison  purposes  in  techniques  such  as  principal  components  (Chap¬ 
ter  12). 

In  general,  for  both  |S|  and  tr(S),  relatively  large  values  reflect  a  broad  scatter 
of  yi,  yi,  .  •  •  ,  y p  about  y,  whereas  lower  values  indicate  closer  concentration  about 
y.  In  the  case  of  |S|,  however,  as  noted  previously,  an  extremely  small  value  of  |S| 
or  |R|  may  indicate  either  small  scatter  or  multicollinearity,  a  term  indicating  near 
linear  relationships  in  a  set  of  variables.  Multicollinearity  may  be  due  to  high  pair¬ 
wise  correlations  or  to  a  high  multiple  correlation  between  one  variable  and  several 
of  the  other  variables.  For  other  measures  of  intercorrelation,  see  Rencher  (1998, 
Section  1.7). 

3.11  ESTIMATION  OF  MISSING  VALUES 

It  is  not  uncommon  to  find  missing  measurements  in  an  observation  vector,  that  is, 
missing  values  for  one  or  more  variables.  A  small  number  of  rows  with  missing 
entries  in  the  data  matrix  Y  [see  (3.17)]  does  not  constitute  a  serious  problem;  we 
can  simply  discard  each  row  that  has  a  missing  value.  However,  with  this  procedure, 
a  small  portion  of  missing  data,  if  widely  distributed,  would  lead  to  a  substantial  loss 
of  data.  For  example,  in  a  large  data  set  with  n  =  550  and  p  —  85,  only  about  1.5% 
of  the  550  x  85  =  46,750  measurements  were  missing.  However,  nearly  half  of  the 
observation  vectors  (rows  of  Y )  turned  out  to  be  incomplete. 

The  distribution  of  missing  values  in  a  data  set  is  an  important  consideration. 
Randomly  missing  variable  values  scattered  throughout  a  data  matrix  are  less  serious 
than  a  pattern  of  missing  values  that  depends  to  some  extent  on  the  values  of  the 
missing  variables. 

We  discuss  two  methods  of  estimating  the  missing  values,  or  “filling  the  holes,” 
in  the  data  matrix,  also  called  imputation.  Both  procedures  presume  that  the  missing 
values  occur  at  random.  If  the  occurrence  or  nonoccurrence  of  missing  values  is 
related  to  the  values  of  some  of  the  variables,  then  the  techniques  may  not  estimate 
the  missing  responses  very  well. 

The  first  method  is  very  simple:  substitute  a  mean  for  each  missing  value,  specif¬ 
ically  the  average  of  the  available  data  in  the  column  of  the  data  matrix  in  which 
the  unknown  value  lies.  Replacing  an  observation  by  its  mean  reduces  the  variance 
and  the  absolute  value  of  the  covariance.  Therefore,  the  sample  covariance  matrix  S 
computed  from  the  data  matrix  Y  in  (3.17)  with  means  imputed  for  missing  values 
is  biased.  However,  it  is  positive  definite. 

The  second  technique  is  a  regression  approach.  The  data  matrix  Y  is  partitioned 
into  two  parts,  one  containing  all  rows  with  missing  entries  and  the  other  comprising 
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all  the  complete  rows.  Suppose  y,j  is  the  only  missing  entry  in  the  1th  row  of  Y. 
Then  using  the  data  in  the  submatrix  with  complete  rows,  yj  is  regressed  on  the 
other  variables  to  obtain  a  prediction  equation  yj  =  bo  +  b\y\  +  •  •  •  +  bj-iyj-i  + 
bj+iyj+i  +  ■  ■  ■  +  bpyp.  Then  the  nonmissing  entries  in  the  1th  row  are  entered  as 
independent  variables  in  the  regression  equation  to  obtain  the  predicted  value,  yij . 
The  regression  method  was  first  proposed  by  Buck  (1960)  and  is  a  special  case  of 
the  EM  algorithm  (Dempster,  Laird,  and  Rubin  1977). 

The  regression  method  can  be  improved  by  iteration,  carried  out,  for  example,  in 
the  following  way.  Estimate  all  missing  entries  in  the  data  matrix  using  regression. 
After  filling  in  the  missing  entries,  use  the  full  data  matrix  to  obtain  new  prediction 
equations.  Use  these  prediction  equations  to  calculate  new  predicted  values  y,j  for 
missing  entries.  Use  the  new  data  matrix  to  obtain  revised  prediction  equations  and 
new  predicted  values  y;/ .  Continue  this  process  until  the  predicted  values  stabilize. 

A  modification  may  be  needed  if  the  missing  entries  are  so  pervasive  that  it  is 
difficult  to  find  data  to  estimate  the  initial  regression  equations.  In  this  case,  the 
process  could  be  started  by  using  means  as  in  the  first  method  and  then  beginning 
the  iteration. 

The  regression  approach  will  ordinarily  yield  better  results  than  the  method  of 
inserting  means.  However,  if  the  other  variables  are  not  very  highly  correlated  with 
the  one  to  be  predicted,  the  regression  technique  is  essentially  equivalent  to  imputing 
means.  The  regression  method  underestimates  the  variances  and  covariances,  though 
to  a  lesser  extent  than  the  method  based  on  means. 

Example  3.11.  We  illustrate  the  iterated  regression  method  of  estimating  missing 
values.  Consider  the  calcium  data  of  Table  3.3  as  reproduced  here  and  suppose  the 
entries  in  parentheses  are  missing: 

Location 


Number 

yi 

T2 

T3 

1 

35 

(3.5) 

2.80 

2 

35 

4.9 

(2.70) 

3 

40 

30.0 

4.38 

4 

10 

2.8 

3.21 

5 

6 

2.7 

2.73 

6 

20 

2.8 

2.81 

7 

35 

4.6 

2.88 

8 

35 

10.9 

2.90 

9 

35 

8.0 

3.28 

10 

30 

1.6 

3.20 

We  first  regress  yj  on  v  i  and  yj  for  observations  3-10  and  obtain  yi  =  £>o  + 
b\y\  +  bj, ya .  When  this  is  evaluated  for  the  two  nonmissing  entries  in  the  first  row 
(yi  =  35  and  yj,  =  2.80),  we  obtain  yi  =  4.097.  Similarly,  we  regress  on  y\  and 
y2  for  observations  3-10  to  obtain  y3  =  co  +  cy  y\  +  cjyi-  Evaluating  this  for  the 
two  nonmissing  entries  in  the  second  row  yields  y3  =  3.011.  We  now  insert  these 
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estimates  for  the  missing  values  and  calculate  the  regression  equations  based  on  all 
10  observations.  Using  the  revised  equation  y2  —  bo  +  b\y \  +  bjyj,,  we  obtain  a  new 
predicted  value,  y2  =  3.698.  Similarly,  we  obtain  a  revised  regression  equation  for  V3 
that  gives  a  new  predicted  value,  yj,  =  2.981.  With  these  values  inserted,  we  calculate 
new  equations  and  obtain  new  predicted  values,  y2  =  3.672  and  U3  =  2.976.  At  the 
third  iteration  we  obtain  yi  —  3.679  and  V3  =  2.975.  There  is  very  little  change 
in  subsequent  iterations.  These  values  are  closer  to  the  actual  values,  y2  =  3.5  and 
j3  =  2.70,  than  the  initial  regression  estimates,  y 2  =  4.097  and  y3  =  3.011.  They 
are  also  much  better  estimates  than  the  means  of  the  second  and  third  columns,  ~y2  = 
7.589  and  y3  =  3.132.  □ 


3.12  DISTANCE  BETWEEN  VECTORS 

In  a  univariate  setting,  the  distance  between  two  points  is  simply  the  difference  (or 
absolute  difference)  between  their  values.  For  statistical  purposes,  this  difference 
may  not  be  very  informative.  For  example,  we  do  not  want  to  know  how  many  cen¬ 
timeters  apart  two  means  are,  but  rather  how  many  standard  deviations  apart  they 
are.  Thus  we  examine  the  standardized  or  statistical  distances,  such  as 

IM1-M2I  \y-fi\ 

-  or  - . 

a  oj 

To  obtain  a  useful  distance  measure  in  a  multivariate  setting,  we  must  consider 
not  only  the  variances  of  the  variables  but  also  their  covariances  or  correlations.  The 
simple  (squared)  Euclidean  distance  between  two  vectors,  (yi  —  y2),(yi  —  >'2),  is 
not  useful  in  some  situations  because  there  is  no  adjustment  for  the  variances  or  the 
covariances.  For  a  statistical  distance,  we  standardize  by  inserting  the  inverse  of  the 
covariance  matrix: 

d2  =  (yi  -  y2)'S_1  (yi  -  y2).  (3.79) 

Other  examples  are 


D2  =  (y-tiYS~l(j-ti ) 

(3.80) 

A2  =  (y-/Lt)'S-1(y-M) 

(3.81) 

A2  =  (im  -  1  -  n 2). 

(3.82) 

These  (squared)  distances  between  two  vectors  were  first  proposed  by  Maha- 
lanobis  (1936)  and  are  often  referred  to  as  Mahalanobis  distances.  If  a  random  vari¬ 
able  has  a  larger  variance  than  another,  it  receives  relatively  less  weight  in  a  Maha¬ 
lanobis  distance.  Similarly,  two  highly  correlated  variables  do  not  contribute  as  much 
as  two  variables  that  are  less  correlated.  In  essence,  then,  the  use  of  the  inverse  of 
the  covariance  matrix  in  a  Mahalanobis  distance  has  the  effect  of  (1)  standardizing 
all  variables  to  the  same  variance  and  (2)  eliminating  correlations.  To  illustrate  this. 
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we  use  the  square  root  matrix  defined  in  (2.1 12)  to  rewrite  (3.81)  as 

A2  =  (y-  /tyx-'fy-  M)  =  (y-  M)'(S1/2S1/2rx(y-  M) 

=  [(21/2)-1(y  —  /x,)]'[(S1/2)_1(y  —  /*)]  =  z'z, 

where  z  =  (21^2)_l(y  —  fi)  =  (S1/2)-1^  —  (21/2)- V-  Now,  by  (3.76)  it  can  be 
shown  that 


cov(z)  =  -I.  (3.83) 

n 

Hence  the  transformed  variables  zi,  Z2,  ■  ■  •  >  Zp  are  uncorrelated,  and  each  has  vari¬ 
ance  equal  to  1  /  n.  If  the  appropriate  covariance  matrix  for  the  random  vector  were 
used  in  a  Mahalanobis  distance,  the  variances  would  reduce  to  1.  For  example,  if 
cov(y)  =  X/n  were  used  above  in  place  of  2,  we  would  obtain  cov(z)  =  I. 


PROBLEMS 

3.1  If  z,  =  ayi  for  i  =  1,2,...  ,  n,  show  that  z  =  ay  as  in  (3.3). 

3.2  If  n  —  ayi  for  i  =  1,2,...  ,  n,  show  that  sz  =  a2s 2  as  in  (3.6). 

3.3  For  the  data  in  Figure  3.3,  show  that  JT  (x,  —  x ) (  v/  —  y)  —  0. 

3.4  Show  that  (x  —  xj)'(y  —  yj)  =  ^,-(x/  —  x)(y,-  —  y),  thus  verifying  (3.15). 

3.5  For  p  =  3  show  that 

1  n  /  ^  1 1  M2  M3  \ 

-  y)(y /  -  y)'  =  I  mi  m2  M3  I , 

n  '='  \  Ml  M2  M3  / 


which  illustrates  (3.27). 

3.6  Show  that  z  —  a'y  as  in  (3.54),  where  z.i  =  a'y /  =  1,2,...,  n. 

3.7  Show  that  s2  —  a'Sa  as  in  (3.55),  where  zi  —  a'y,-,  i  =  1,  2, ...  ,  n. 

3.8  Show  that  tr(ASA')  =  X]/=i  a-Sa,-  as  in  (3.65). 

3.9  Use  (3.76)  to  verify  (3.83),  cov(z)  =  l/n,  where  z  =  (2x/2)-1(y  —  ju,). 

3.10  Use  the  calcium  data  in  Table  3.3: 

(a)  Calculate  S  using  the  data  matrix  Y  as  in  (3.29). 

(b)  Obtain  R  by  calculating  qi,  n3,  and  r23,  as  in  (3.34)  and  (3.35). 

(c)  Find  R  using  (3.37). 

3.11  Use  the  calcium  data  in  Table  3.3: 

(a)  Find  the  generalized  sample  variance  |S|  as  in  (3.77). 

(b)  Find  the  total  sample  variance  tr(S),  as  in  (3.78). 
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3.12  Use  the  probe  word  data  of  Table  3.5: 

(a)  Find  the  generalized  sample  variance  |S|  as  in  (3.77). 

(b)  Find  the  total  sample  variance  tr(S)  as  in  (3.78). 

3.13  For  the  probe  word  data  in  Table  3.5,  find  R  using  (3.37). 

3.14  For  the  variables  in  Table  3.3,  define  z  =  3yi  —  y2  +  2\>3  =  (3,  —1,  2)y.  Find 
Z  and  si  in  two  ways: 

(a)  Evaluate  z  for  each  row  of  Table  3.3  and  find  z  and  si  directly  from  zi, 

Z2 - -  zio  using  (3.1)  and  (3.5). 

(b)  Use  z  =  a'y  and  si  —  a'Sa,  as  in  (3.54)  and  (3.55). 

3.15  For  the  variables  in  Table  3.3,  define  w  —  — 2yi  +  3\’2  +  >’3  and  define  z  as  in 
Problem  3.14.  Find  rzw  in  two  ways: 

(a)  Evaluate  z  and  w  for  each  row  of  Table  3.3  and  find  rzw  from  the  10  pairs 
(z,i,  Wj),  i  —  1,2,...  ,10,  using  (3.10)  and  (3.13). 

(b)  Find  rzw  using  (3.57). 

3.16  For  the  variables  in  Table  3.3,  find  the  correlation  between  yi  and  j(}’2  +  >’3 ) 
using  (3.57). 


Table  3.6.  Ramus  Bone  Length  at  Four  Ages  for  20  Boys 


Age 

8  yr 

85  yr 

9  yr 

9iyr 

Individual 

(Vi) 

(y2) 

Ob) 

(v4) 

1 

47.8 

48.8 

49.0 

49.7 

2 

46.4 

47.3 

47.7 

48.4 

3 

46.3 

46.8 

47.8 

48.5 

4 

45.1 

45.3 

46.1 

47.2 

5 

47.6 

48.5 

48.9 

49.3 

6 

52.5 

53.2 

53.3 

53.7 

7 

51.2 

53.0 

54.3 

54.5 

8 

49.8 

50.0 

50.3 

52.7 

9 

48.1 

50.8 

52.3 

54.4 

10 

45.0 

47.0 

47.3 

48.3 

11 

51.2 

51.4 

51.6 

51.9 

12 

48.5 

49.2 

53.0 

55.5 

13 

52.1 

52.8 

53.7 

55.0 

14 

48.2 

48.9 

49.3 

49.8 

15 

49.6 

50.4 

51.2 

51.8 

16 

50.7 

51.7 

52.7 

53.3 

17 

47.2 

47.7 

48.4 

49.5 

18 

53.3 

54.6 

55.1 

55.3 

19 

46.2 

47.5 

48.1 

48.4 

20 

46.3 

47.6 

51.3 

51.8 
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3.17  Define  the  following  linear  combinations  for  the  variables  in  Table  3.3: 

z  t  =  yi  +  yi  +  J3, 

Z2  =  2yi  -  3v2  +  2.V3, 

23  =  —y\  -  2^2  -  3^3. 

(a)  Find  z  and  St  using  (3.62)  and  (3.64). 

(b)  Find  R-  from  S-  using  (3.37). 

3.18  The  data  in  Table  3.6  (Elston  and  Grizzle  1962)  consist  of  measurements  yi, 
y2,  >’3,  and  V4  of  the  ramus  bone  at  four  different  ages  on  each  of  20  boys. 

(a)  Find  y,  S,  and  R 

(b)  Find  |S|  and  tr(S). 


Table  3.7.  Measurements  on  the  First  and  Second  Adult 
Sons  in  a  Sample  of  25  Families 


First  Son 

Second  Son 

Head 

Length 

yi 

Head 

Breadth 

,V2 

Head 

Length 

Xl 

Head 

Breadth 

*2 

191 

155 

179 

145 

195 

149 

201 

152 

181 

148 

185 

149 

183 

153 

188 

149 

176 

144 

171 

142 

208 

157 

192 

152 

189 

150 

190 

149 

197 

159 

189 

152 

188 

152 

197 

159 

192 

150 

187 

151 

179 

158 

186 

148 

183 

147 

174 

147 

174 

150 

185 

152 

190 

159 

195 

157 

188 

151 

187 

158 

163 

137 

161 

130 

195 

155 

183 

158 

186 

153 

173 

148 

181 

145 

182 

146 

175 

140 

165 

137 

192 

154 

185 

152 

174 

143 

178 

147 

176 

139 

176 

143 

197 

167 

200 

158 

190 

163 

187 

150 
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3.19  For  the  data  in  Table  3.6,  define  z  =  y\  +  2y2  +  T3  —  3y4  and  w  =  —2y\  + 
3.T2  -  J3  +  2)4- 

(a)  Find  z,  w,  s},  and  s„;  using  (3.54)  and  (3.55). 

(b)  Find  szw  and  rzw  using  (3.56)  and  (3.57). 

3.20  For  the  data  in  Table  3.6  define 


z t  =  2yi  +  3v2  -  T3  +  4y4, 

Z2  =  — 2yi  -  V2  +  4y3  -  2_y4, 
Z3  =  3yi  -  2y2  -  T3  +  3y4. 


Find  z,  Sz,  and  R-  using  (3.62),  (3.64),  and  (3.37),  respectively. 

3.21  The  data  in  Table  3.7  consist  of  head  measurements  on  first  and  second  sons 
(Frets  1921).  Define  yi  and  y2  as  the  measurements  on  the  first  son  and  x\  and 
xo  for  the  second  son. 

(a)  Find  the  mean  vector  for  all  four  variables  and  partition  it  into  Q  as  in 
(3.41). 

(b)  Find  the  covariance  matrix  for  all  four  variables  and  partition  it  into 

C  _  (  Syy  S 

v  sxy  S. 

as  in  (3.42). 

3.22  Table  3.8  contains  data  from  O’Sullivan  and  Mahan  (1966;  see  also  Andrews 
and  Herzberg  1985,  p.  214)  with  measurements  of  blood  glucose  levels  on  three 
occasions  for  50  women.  The  y’s  represent  fasting  glucose  measurements  on 
the  three  occasions;  the  .r’s  are  glucose  measurements  1  hour  after  sugar  intake. 
Find  the  mean  vector  and  covariance  matrix  for  all  six  variables  and  partition 
them  into  (|),  as  in  (3.41),  and 


as  in  (3.42). 


Table  3.8.  Blood  Glucose  Measurements  on  Three  Occasions 


Fasting 

One  Hour  after  Sugar  Intake 

yi 

y2 

V3 

Xl 

*2 

*3 

60 

69 

62 

97 

69 

98 

56 

53 

84 

103 

78 

107 

80 

69 

76 

66 

99 

130 

55 

80 

90 

80 

85 

114 

( continued ) 


Table  3.8.  (Continued) 


Fasting 

One  Hour  after  Sugar  Intake 

,Vl 

,V2 

F3 

*i 

*2 

*3 

62 

75 

68 

116 

130 

91 

74 

64 

70 

109 

101 

103 

64 

71 

66 

77 

102 

130 

73 

70 

64 

115 

110 

109 

68 

67 

75 

76 

85 

119 

69 

82 

74 

72 

133 

127 

60 

67 

61 

130 

134 

121 

70 

74 

78 

150 

158 

100 

66 

74 

78 

150 

131 

142 

83 

70 

74 

99 

98 

105 

68 

66 

90 

119 

85 

109 

78 

63 

75 

164 

98 

138 

103 

77 

77 

160 

117 

121 

77 

68 

74 

144 

71 

153 

66 

77 

68 

77 

82 

89 

70 

70 

72 

114 

93 

122 

75 

65 

71 

77 

70 

109 

91 

74 

93 

118 

115 

150 

66 

75 

73 

170 

147 

121 

75 

82 

76 

153 

132 

115 

74 

71 

66 

143 

105 

100 

76 

70 

64 

114 

113 

129 

74 

90 

86 

73 

106 

116 

74 

77 

80 

116 

81 

77 

67 

71 

69 

63 

87 

70 

78 

75 

80 

105 

132 

80 

64 

66 

71 

83 

94 

133 

71 

80 

76 

81 

87 

86 

63 

75 

73 

120 

89 

59 

90 

103 

74 

107 

109 

101 

60 

76 

61 

99 

111 

98 

48 

77 

75 

113 

124 

97 

66 

93 

97 

136 

112 

122 

74 

70 

76 

109 

88 

105 

60 

74 

71 

72 

90 

71 

63 

75 

66 

130 

101 

90 

66 

80 

86 

130 

117 

144 

77 

67 

74 

83 

92 

107 

70 

67 

100 

150 

142 

146 

73 

76 

81 

119 

120 

119 

78 

90 

77 

122 

155 

149 

73 

68 

80 

102 

90 

122 

72 

83 

68 

104 

69 

96 

65 

60 

70 

119 

94 

89 

52 

70 

76 

92 

94 

100 

Note:  Measurements  are  in  mg/100  ml. 
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CHAPTER  4 


The  Multivariate  Normal  Distribution 


4.1  MULTIVARIATE  NORMAL  DENSITY  FUNCTION 

Many  univariate  tests  and  confidence  intervals  are  based  on  the  univariate  normal 
distribution.  Similarly,  the  majority  of  multivariate  procedures  have  the  multivariate 
normal  distribution  as  their  underpinning. 

The  following  are  some  of  the  useful  features  of  the  multivariate  normal  distri¬ 
bution  (see  Section  4.2):  (1)  the  distribution  can  be  completely  described  using  only 
means,  variances,  and  covariances;  (2)  bivariate  plots  of  multivariate  data  show  lin¬ 
ear  trends;  (3)  if  the  variables  are  uncorrelated,  they  are  independent;  (4)  linear  func¬ 
tions  of  multivariate  normal  variables  are  also  normal;  (5)  as  in  the  univariate  case, 
the  convenient  form  of  the  density  function  lends  itself  to  derivation  of  many  prop¬ 
erties  and  test  statistics;  and  (6)  even  when  the  data  are  not  multivariate  normal,  the 
multivariate  normal  may  serve  as  a  useful  approximation,  especially  in  inferences 
involving  sample  mean  vectors,  which  are  approximately  multivariate  normal  by  the 
central  limit  theorem  (see  Section  4.3.2). 

Since  the  multivariate  normal  density  is  an  extension  of  the  univariate  normal  den¬ 
sity  and  shares  many  of  its  features,  we  review  the  univariate  normal  density  function 
in  Section  4.1.1.  We  then  describe  the  multivariate  normal  density  in  Sections  4. 1 .2— 
4.1.4. 


4.1.1  Univariate  Normal  Density 

If  a  random  variable  y,  with  mean  /i  and  variance  a2,  is  normally  distributed,  its 
density  is  given  by 


/  OO  = 


(y-U)2/2<72 


— oo  <  y  <  oo. 


(4.1) 


When  y  has  the  density  (4.1),  we  say  that  y  is  distributed  as  /V (/i.  a2),  or  simply  y  is 
N(/x,  a2).  This  function  is  represented  by  the  familiar  bell-shaped  curve  illustrated 
in  Figure  4. 1  for  /x  =  10  and  a  =  2.5. 
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f(y) 


Figure  4.1.  The  normal  density  curve. 


4.1.2  Multivariate  Normal  Density 

If  y  has  a  multivariate  normal  distribution  with  mean  vector  p  and  covariance  matrix 
2,  the  density  is  given  by 


g(y)  = 


1  c-(j-nyZ-l<y-ii)/2 

(VW  IXI1/2 


(4.2) 


where  p  is  the  number  of  variables.  When  y  has  the  density  (4.2),  we  say  that  y  is 
distributed  as  Np(p,  2),  or  simply  y  is  Np(p,  2). 

The  term  (y  —  pi)2 la1-  —  (y  —  /x)(cr2)-1  (y  —  p)  in  the  exponent  of  the  univariate 
normal  density  (4.1)  measures  the  squared  distance  from  y  to  p  in  standard  deviation 
units.  Similarly,  the  term  (y  —  pYX~l  (y  —  jut)  in  the  exponent  of  the  multivariate  nor¬ 
mal  density  (4.2)  is  the  squared  generalized  distance  from  y  to  p,  or  the  Mahalanobis 
distance, 


A2  =  (y  -  /i)'2_1  (y  -  p).  (4.3) 

The  characteristics  of  this  distance  between  y  and  p  were  discussed  in  Section  3.12. 
Note  that  A,  the  square  root  of  (4.3),  is  not  in  standard  deviation  units  as  is  ( y  —  p)/a . 
The  distance  A  increases  with  p ,  the  number  of  variables  (see  Problem  4.4). 

In  the  coefficient  of  the  exponential  function  in  (4.2),  |2| 1//2  appears  as  the  ana¬ 
logue  of  sTo2  in  (4.1).  In  the  next  section,  we  discuss  the  effect  of  |2|  on  the  density. 

4.1.3  Generalized  Population  Variance 

In  Section  3.10,  we  referred  to  |S|  as  a  generalized  sample  variance.  Analogously, 
|2|  is  a  generalized  population  variance.  If  a2  is  small  in  the  univariate  normal, 
the  y  values  are  concentrated  near  the  mean.  Similarly,  a  small  value  of  |2|  in  the 
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(a)  small®  (b)  large  |S| 

Figure  4.2.  Bivariate  normal  densities. 


multivariate  case  indicates  that  the  y’s  are  concentrated  close  to  p  in  p -space  or  that 
there  is  multicollinearity  among  the  variables.  The  term  multicollinearity  indicates 
that  the  variables  are  highly  intercorrelated,  in  which  case  the  effective  dimensional¬ 
ity  is  less  than  p.  (See  Chapter  12  for  a  method  of  finding  a  reduced  number  of  new 
dimensions  that  represent  the  data.)  In  the  presence  of  multicollinearity,  one  or  more 
eigenvalues  of  X  will  be  near  zero  and  |X|  will  be  small,  since  |X|  is  the  product  of 
the  eigenvalues  [see  (2.108)]. 

Figure  4.2  shows,  for  the  bivariate  case,  a  comparison  of  a  distribution  with  small 
|X|  and  a  distribution  with  larger  |X| .  An  alternative  way  to  portray  the  concentration 
of  points  in  the  bivariate  normal  distribution  is  with  contour  plots.  Figure  4.3  shows 
contour  plots  for  the  two  distributions  in  Figure  4.2.  Each  ellipse  contains  a  different 
proportion  of  observation  vectors  y.  The  contours  in  Figure  4.3  can  be  found  by 
setting  the  density  function  equal  to  a  constant  and  solving  for  y,  as  illustrated  in 
Figure  4.4.  The  bivariate  normal  density  surface  sliced  at  a  constant  height  traces 
an  ellipse,  which  contains  a  given  proportion  of  the  observations  (Rencher  1998, 
Section  2.1.3). 

In  both  Figures  4.2  and  4.3,  small  |X|  appears  on  the  left  and  large  |X|  appears 
on  the  right.  In  Figure  4.3a,  there  is  a  larger  correlation  between  y\  and  yi-  In  Fig¬ 
ure  4.3 b,  the  variances  are  larger  (in  the  natural  directions).  In  general,  for  any  num- 


4  oh 


-lh 


(b)  large  |  Z  | 


Figure  4.3.  Contour  plots  for  the  distributions  in  Figure  4.2. 
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ber  of  variables  p,  a  decrease  in  intercorrelations  among  the  variables  or  an  increase 
in  the  variances  will  lead  to  a  larger  |X|. 


4.1.4  Diversity  of  Applications  of  the  Multivariate  Normal 

Nearly  all  the  inferential  procedures  we  discuss  in  this  book  are  based  on  the  mul¬ 
tivariate  normal  distribution.  We  acknowledge  that  a  major  motivation  for  the 
widespread  use  of  the  multivariate  normal  is  its  mathematical  tractability.  From 
the  multivariate  normal  assumption,  a  host  of  useful  procedures  can  be  derived, 
and  many  of  these  are  available  in  software  packages.  Practical  alternatives  to  the 
multivariate  normal  are  fewer  than  in  the  univariate  case.  Because  it  is  not  as  simple 
to  order  (or  rank)  multivariate  observation  vectors  as  it  is  for  univariate  observations, 
there  are  not  as  many  nonparametric  procedures  available  for  multivariate  data. 

Although  real  data  may  not  often  be  exactly  multivariate  normal,  the  multivariate 
normal  will  frequently  serve  as  a  useful  approximation  to  the  true  distribution.  Tests 
and  graphical  procedures  are  available  for  assessing  normality  (see  Sections  4.4  and 
4.5).  Fortunately,  many  of  the  procedures  based  on  multivariate  normality  are  robust 
to  departures  from  normality. 


4.2  PROPERTIES  OF  MULTIVARIATE  NORMAL 
RANDOM  VARIABLES 

We  list  some  of  the  properties  of  a  random  p  x  1  vector  y  from  a  multivariate  normal 
distribution  Np(fJi,  X)'. 


86 


THE  MULTIVARIATE  NORMAL  DISTRIBUTION 


1.  Normality  of  linear  combinations  of  the  variables  in  y: 

(a)  If  a  is  a  vector  of  constants,  the  linear  function  a'y  =  aiyi  +  <12 y 2  +  •  •  • 
+  apyp  is  univariate  normal: 

If  y  is  Np(p,  2),  then  a'y  is  N( a'ju,,  a'2a). 

The  mean  and  variance  of  a'y  were  given  in  (3.69)  and  (3.70)  as  E( a'y)  = 
a'  p  and  var(a'y)  =  a' 2 a  for  any  random  vector  y.  We  now  have  the 
additional  attribute  that  a'y  has  a  (univariate)  normal  distribution  if  y  is 

NP(ii,  2). 

(b)  If  A  is  a  constant  q  x  p  matrix  of  rank  q ,  where  q  <  p,  the  q  linear 
combinations  in  Ay  have  a  multivariate  normal  distribution: 

If  y  is  Np{p ,  2),  then  Ay  is  Nq(Ap,  A2A'). 

Here,  again,  it  (Ay)  =  Ap .  and  cov(Ay)  =  A2A',  in  general,  as  given 
in  (3.73)  and  (3.74).  But  we  now  have  the  additional  feature  that  the  q 
variables  in  Ay  have  a  multivariate  normal  distribution. 

2.  Standardized  variables: 

A  standardized  vector  z  can  be  obtained  in  two  ways: 

z=  (T'r1(y-/i).  (4-4) 

where  2  =  T'T  is  factored  using  the  Cholesky  procedure  in  Section  2.7,  or 

z=  (21/2)-1(y-^),  (4.5) 

where  2 1  ,/2  is  the  symmetric  square  root  matrix  of  2  defined  in  (2.112)  such 
that  2  =  21/221/2.  In  either  (4.4)  or  (4.5),  the  standardized  vector  of  random 
variables  has  all  means  equal  to  0,  all  variances  equal  to  1 ,  and  all  correlations 
equal  to  0.  In  either  case,  it  follows  from  property  lb  that  z  is  multivariate 
normal: 


If  y  is  Np(fi,  2),  then  z  is  Np( 0, 1). 

3.  Chi-square  distribution: 

A  chi-square  random  variable  with  p  degrees  of  freedom  is  defined  as  the 
sum  of  squares  of  p  independent  standard  normal  random  variables.  Thus,  if 
z  is  the  standardized  vector  defined  in  (4.4)  or  (4.5),  then  ]C/=i  Z1  —  zz  has 
the  /^distribution  with  p  degrees  of  freedom,  denoted  as  xp  or  X2(p)-  From 
either  (4.4)  or  (4.5)  we  obtain  z'z  =  (y  —  p)1 2-1(y  —  p).  Hence, 


If  y  is  Np(p,  2),  then  (y  —  p)'  2  '(y-  p)  is/2. 


(4.6) 
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4.  Normality  of  marginal  distributions: 

(a)  Any  subset  of  the  y’s  in  y  has  a  multivariate  normal  distribution,  with 
mean  vector  consisting  of  the  corresponding  subvector  of  fj,  and  covari¬ 
ance  matrix  composed  of  the  corresponding  submatrix  of  2.  To  illustrate, 
let  yi  =  (  y  i ,  yi,  ■  ■  ■  ,  y rY  denote  the  subvector  containing  the  first  r  ele¬ 
ments  of  y  and  y2  =  (yy+i , . . .  ,  yp)'  consist  of  the  remaining  p  —  r 
elements.  Thus  y,  /jl,  and  2  are  partitioned  as 


y  = 


yi 

y2 


2  = 


Sn  2i2  \ 

X21  S22  ) 


where  yi  and  /jl\  are  r  x  1  and  2i  1  is  r  x  r.  Then  yi  is  multivariate  normal: 


If  y  is  Np(ix,  2),  then  y!  is  Nr(fiu  2n). 

Here,  again,  £(yi)  =  /jl\  and  cov(yi)  =  2 1 1  hold  for  any  random  vector 
partitioned  in  this  way.  But  if  y  is  p-variate  normal,  then  yi  is  r- variate 
normal. 

(b)  As  a  special  case  of  the  preceding  result,  each  y j  in  y  has  the  univariate 
normal  distribution: 


If  y  is  Np(fi,  2),  then  yj  is  N(pj,  ojj),  j  —  1,2,...  ,  p. 


The  converse  of  this  is  not  true.  If  the  density  of  each  y ,  in  y  is  normal,  it 
does  not  necessarily  follow  that  y  is  multivariate  normal. 

In  the  next  three  properties,  let  the  observation  vector  be  partitioned  into  two 
subvectors  denoted  by  y  and  x,  where  y  is  p  x  1  and  x  is  q  x  1 .  Or,  alternatively,  let 
x  represent  some  additional  variables  to  be  considered  along  with  those  in  y.  Then, 
as  in  (3.45)  and  (3.46), 


In  properties  5,  6,  and  7,  we  assume  that 


5.  Independence: 

(a)  The  subvectors  y  and  x  are  independent  if  %yx  =  O. 

(b)  Two  individual  variables  y,  and  yy  are  independent  if  er;i-  =  0.  Note  that 
this  is  not  true  for  many  nonnormal  random  variables,  as  illustrated  in 
Section  3.2.1. 
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6.  Conditional  distribution: 

If  y  and  x  are  not  independent,  then  Xyx  ~f~  o,  and  the  conditional  distribution 
of  y  given  x,  /(y|x),  is  multivariate  normal  with 

£(y|x)  =  fxy  +  XyxXfx (x  -  fix),  (4.7) 

cov(y|x)  =  Xyy  -  XyXX~^XXy.  (4.8) 

Note  that  £(y|x)  is  a  vector  of  linear  functions  of  x,  whereas  cov(y|x)  is  a 
matrix  that  does  not  depend  on  x.  The  linear  trend  in  (4.7)  holds  for  any  pair  of 
variables.  Thus  to  use  (4.7)  as  a  check  on  normality,  one  can  examine  bivariate 
scatter  plots  of  all  pairs  of  variables  and  look  for  any  nonlinear  trends.  In  (4.7), 
we  have  the  justification  for  using  the  covariance  or  correlation  to  measure 
the  relationship  between  two  bivariate  normal  random  variables.  As  noted  in 
Section  3.2.1,  the  covariance  and  correlation  are  good  measures  of  relationship 
only  for  variables  with  linear  trends  and  are  generally  unsuitable  for  nonnormal 
random  variables  with  a  curvilinear  relationship.  The  matrix  XyxXfx  in  (4.7) 
is  called  the  matrix  of  regression  coefficients  because  it  relates  E(y|x)  to  x. 
The  sample  counterpart  of  this  matrix  appears  in  (10.52). 

7.  Distribution  of  the  sum  of  two  subvectors: 

If  y  and  x  are  the  same  size  (both  p  x  1 )  and  independent,  then 

y  +  xis  Np(fiy  +  b^X  i  2 yy  +  2  XX  )>  (4.9) 

y  -  x  is  Np  (fly  -  fix ,  Xyy  +  Xxx )  •  (4.10) 

In  the  remainder  of  this  section,  we  illustrate  property  6  for  the  special  case  of  the 
bivariate  normal.  Let 


have  a  bivariate  normal  distribution  with 


E(  u)  = 


fly 

fix 


cov(u)  =  X  = 


Jyx 


Jyx 


By  definition  f(y\x)  =  g(y,  x)/h(x),  where  li(x)  is  the  density  of  x  and  g(y,  x)  is 
the  joint  density  of  y  and  x.  Hence 


g(y,x)  =  f(y\x)h(x), 


and  because  the  right  side  is  a  product,  we  seek  a  function  of  y  and  x  that  is  inde¬ 
pendent  of  x  and  whose  density  can  serve  as  f(y\x).  Since  linear  functions  of  y  and 
x  are  normal  by  property  la,  we  consider  y  —  fix  and  seek  the  value  of  fi  so  that 
y  —  fix  and  x  are  independent. 
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Since  z  =  y  —  Px  and  x  are  normal  and  independent,  covf.r,  z )  =  0.  To  find 
covfx,  z),  we  express  x  and  z  as  functions  of  u. 


v  =  (0,  1) 


(0,  l)u  =  a'n, 


z  =  y  —  fix  =  (1,  —  P)u  =  b'u. 


Now 


cov(x,  z)  =  covfa'u,  b'u) 


=  a'Xb 

[by  (3.71)] 

/  2 

\ 

=  (0,  1)1 

\x ) 

V 

°x  J 

=  ayx  ~ 

Pal 

1 

~P 


Since  cov(jc,  z)  =  0,  we  obtain  p  =  cryx/cr~,  and  z  =  y  —  fix  becomes 


By  property  la,  the  density  of  y  —  {ayx/ax)x  is  normal  with 


uyx 

a } 


uyx 

var  I  y - —x 

o$ 


Jyx 


l^y  9  l^x  •> 

:  var(b'u)  =  b'Sb 


=  1. 


Jyx 


2  uyx 

y  rr2  ■ 


'yx 


uy* 

/t2 


For  a  given  value  of  x,  we  can  express  y  as  y  —  fix  +  (y  —  fix),  where  fix 
is  a  fixed  quantity  corresponding  to  the  given  value  of  x  and  y  —  fix  is  a  random 
deviation.  Then  f(y\x)  is  normal,  with 


E(y  \x)  —  fix  +  E(y  -  fix)  =  fix  +  ny  -  P/xx 

Cf  y% 

=  Hy  +  P(x  -  fXx)  =  l-ly  +  —jix  -  Hx), 

° X 

var(y|x)  =  a; - 

^  Gy 
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4.3  ESTIMATION  IN  THE  MULTIVARIATE  NORMAL 
4.3.1  Maximum  Likelihood  Estimation 

When  a  distribution  such  as  the  multivariate  normal  is  assumed  to  hold  for  a  popula¬ 
tion,  estimates  of  the  parameters  are  often  found  by  the  method  of  maximum  likeli¬ 
hood.  This  technique  is  conceptually  simple:  The  observation  vectors  yi,  y2,  ■  ■  ■  ,  y „ 
are  considered  to  be  known  and  values  of  fx  and  X  are  sought  that  maximize  the 
joint  density  of  the  y’s,  called  the  likelihood  function.  For  the  multivariate  normal, 
the  maximum  likelihood  estimates  of  fx  and  X  are 

£  =  y< 

2  =  -  f>  -  y)(y,  -  y)' 

n  f— ' 

/=i 

1 

=  -W 

n 


n 

where  W  =  Yl'!=  i  (y/  ~  y)  (yi  —  y)'  and  S  is  the  sample  covariance  matrix  defined 
in  (3.22)  and  (3.27).  Since  X  has  divisor  n  instead  of  n  —  1,  it  is  biased  [see  (3.33)], 
and  we  usually  use  S  in  place  of  X. 

We  now  give  a  justification  of  y  as  the  maximum  likelihood  estimator  of  fx. 
Because  the  y,-’s  constitute  a  random  sample,  they  are  independent,  and  the  joint 
density  is  the  product  of  the  densities  of  the  y’s.  The  likelihood  function  is,  there¬ 
fore. 


(4.11) 


(4.12) 


L(yu  yi,  . .  ,  y„,  tx,  X)  =  Y[  /( yu  2) 

!  =  1 

—  TT _ I _ e— (y. — (y,- — #a)/2 

U  (V^F^ISI1/2 

= _ ! _ e-E"=i(yi-ft)'S_1(y.-^)/2  cq  134 

(V2^)^|X|''/2  ' 

To  see  that  fx  =  y  maximizes  the  likelihood  function,  we  begin  by  adding  and  sub¬ 
tracting  y  in  the  exponent  in  (4.13), 

1  n 

-^(y;  -y  +  y-txYX^iyi  -y  +  y-fx). 

When  this  is  expanded  in  terms  of  y,  —  y  and  y  —  /X,  two  of  the  four  resulting  terms 
vanish  because  £T(y,-  —  y)  =  0,  and  (4.13)  becomes 
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l  _ _ : _ e-E"=i(y;-y)'£  Hyt- y)/2-n(y-i*)'2  1  (y— #*)/2  (414) 

(v/27r)'1/,|2|"/2 

Since  X-1  is  positive  definite,  we  have  — n(y  —  ju.)'X_1( y  —  /x ) /2  <  0  and  0  < 
g-n(y-/i)'S  (y  —  f*-)/2  <  |  wjtp  qle  maximum  occurring  when  the  exponent  is  0. 
Therefore,  L  is  maximized  when  jx  =  y. 

The  maximum  likelihood  estimator  of  the  population  correlation  matrix  Pp  [see 
(3.39)]  is  the  sample  correlation  matrix,  that  is, 


P 


p 


=  R. 


Relationships  among  multinormal  variables  are  linear,  as  can  be  seen  in  (4.7). 
Thus  the  estimators  S  and  R  serve  well  for  the  multivariate  normal  because  they 
measure  only  linear  relationships  (see  Sections  3.2.1  and  4.2).  These  estimators  are 
not  as  useful  for  some  nonnormal  distributions. 


4.3.2  Distribution  of  y  and  S 

For  the  distribution  of  y  =  1  J ri !n >  we  can  distinguish  two  cases: 


1.  When  y  is  based  on  a  random  sample  yi,  y2,  ■  •  •  ,  y„  from  a  multivariate  nor¬ 
mal  distribution  Npifi,X),  then  y  is  Npiix,X/n). 

2.  When  y  is  based  on  a  random  sample  yi,  y2,  •  •  •  •  V«  from  a  nonnormal  multi¬ 
variate  population  with  mean  vector  p.  and  covariance  matrix  X,  then  for  large 
n,  y  is  approximately  Np(ijl,  X/n).  More  formally,  this  result  is  known  as  the 
multivariate  central  limit  theorem :  If  y  is  the  mean  vector  of  a  random  sample 
yi,  y2, .. .  ,  y„  from  a  population  with  mean  vector  /a  and  covariance  matrix 
X,  then  as  n  — >  00,  the  distribution  of  «/n( y  —  {x)  approaches  /V/; ( 0,  X). 


There  are  p  variances  in  S  and  (C)  covariances,  for  a  total  of 


P  + 


P 

2 


=  P  + 


P(P  ~  1) 
2 


pip  +  1) 
2 


distinct  entries.  The  joint  distribution  of  these  pip  +  l)/2  distinct  variables  in  W  = 
(n— 1)S  =  J2i(yi  —  y)(y/  —  y)'  is  the  Wishart  distribution,  denoted  by  Wp(n—  1 ,  X), 
where  n  —  1  is  the  degrees  of  freedom. 

The  Wishart  distribution  is  the  multivariate  analogue  of  the  ^-distribution,  and 
it  has  similar  uses.  As  noted  in  property  3  of  Section  4.2,  a  yj  random  variable  is 
defined  formally  as  the  sum  of  squares  of  independent  standard  normal  (univariate) 
random  variables: 


E 


zr 


E 

(=i 


O’;  -  Mi¬ 


ls  x  in). 
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If  y  is  substituted  for  /x,  then  (y/  —  v)2/e r2  =  (n  —  \)s2/o2  is  /  2(n  —  1).  Similarly, 

the  formal  definition  of  a  Wishart  random  variable  is 

n 

-  /u,)(y/  -  M)  is  Wp(n,  X),  (4.15) 

f=i 

where  yi,  y2, . . .  ,  y«  are  independently  distributed  as  Np(fx,  X).  When  y  is  substi¬ 
tuted  for  /x,  the  distribution  remains  Wishart  with  one  less  degree  of  freedom: 

n 

(n  -  1)S  =  ^(y,-  -  y)(y,  -  y)'  is  Wp(n  -  1,  X).  (4.16) 

f=i 

Finally,  we  note  that  when  sampling  from  a  multivariate  normal  distribution,  y 
and  S  are  independent. 


4.4  ASSESSING  MULTIVARIATE  NORMALITY 

Many  tests  and  graphical  procedures  have  been  suggested  for  evaluating  whether 
a  data  set  likely  originated  from  a  multivariate  normal  population.  One  possibility 
is  to  check  each  variable  separately  for  univariate  normality.  Excellent  reviews  for 
both  the  univariate  and  multivariate  cases  have  been  given  by  Gnanadesikan  (1997, 
pp.  178-220)  and  Seber  (1984,  pp.  141-155).  We  give  a  representative  sample  of 
univariate  and  multivariate  methods  in  Sections  4.4.1  and  4.4.2,  respectively. 

4.4.1  Investigating  Univariate  Normality 

When  we  have  several  variables,  checking  each  for  univariate  normality  should  not 
be  the  sole  approach,  because  (1)  the  variables  are  correlated  and  (2)  normality  of 
the  individual  variables  does  not  guarantee  joint  normality.  On  the  other  hand,  mul¬ 
tivariate  normality  implies  individual  normality.  Hence,  if  even  one  of  the  separate 
variables  is  not  normal,  the  vector  is  not  multivariate  normal.  An  initial  check  on  the 
individual  variables  may  therefore  be  useful. 

A  basic  graphical  approach  for  checking  normality  is  the  Q-Q  plot  comparing 
quantiles  of  a  sample  against  the  population  quantiles  of  the  univariate  normal.  If  the 
points  are  close  to  a  straight  line,  there  is  no  indication  of  departure  from  normality. 
Deviation  from  a  straight  line  indicates  nonnormality  (at  least  for  a  large  sample).  In 
fact,  the  type  of  nonlinear  pattern  may  reveal  the  type  of  departure  from  normality. 
Some  possibilities  are  illustrated  in  Figure  4.5. 

Quantiles  are  similar  to  the  more  familiar  percentiles,  which  are  expressed  in 
terms  of  percent;  a  test  score  at  the  90th  percentile,  for  example,  is  above  90%  of 
the  test  scores  and  below  10%  of  them.  Quantiles  are  expressed  in  terms  of  fractions 
or  proportions.  Thus  the  90th  percentile  score  becomes  the  .9  quantile  score. 

The  sample  quantiles  for  the  Q-Q  plot  are  obtained  as  follows.  First  we  rank  the 
observations  y\,  yi, . . .  ,  yn  and  denote  the  ordered  values  by  y<i),  J(2), . . .  ,  y(n)\ 
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Figure  4.5.  Typical  Q-Q  plots  for  nonnormal  data. 


thus  y(\)  <  j( 2)  <  •  •  •  <  y(n).  Then  the  point  yq)  is  the  i/n  sample  quantile.  For 
example,  if  n  —  20,  ya~>  is  the  =  .35  quantile,  because  .35  of  the  sample  is  less 
than  or  equal  to  V( 7).  The  fraction  i / n  is  often  changed  to  (i  —  ^)/n  as  a  continuity 
correction.  If  n  =  20,  (i  —  j)/n  ranges  from  .025  to  .975  and  more  evenly  covers  the 
interval  from  0  to  1.  With  this  convention,  yp)  is  designated  as  the  (i  —  j)/ n  sample 
quantile. 

The  population  quantiles  for  the  Q-Q  plot  are  similarly  defined  corresponding  to 
(i  —  \)/n.  If  we  denote  these  by  q\,  qz, . .  ■  ,  qn,  then  q ,•  is  the  value  below  which  a 
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proportion  (i  —  \)/n  of  the  observations  in  the  population  lie;  that  is,  (i  —  A)/w  is 
the  probability  of  getting  an  observation  less  than  or  equal  to  q, .  Formally,  q,  can  be 
found  for  the  standard  normal  random  variable  y  with  distribution  /V  (0,  1)  by  solving 

i  —  A 

< Hqi)  =  P(y  <  q{)  = - (4.17) 

n 

which  would  require  numerical  integration  or  tables  of  the  cumulative  standard  nor¬ 
mal  distribution,  <E>(x).  Another  benefit  of  using  (i  —  j)/n  instead  of  i/n  is  that 
n/n  —  1  would  make  q„  =  oo. 

The  population  need  not  have  the  same  mean  and  variance  as  the  sample,  since 
changes  in  mean  and  variance  merely  change  the  slope  and  intercept  of  the  plotted 
line  in  the  Q-Q  plot.  Therefore,  we  use  the  standard  normal  distribution,  and  the  <7, 
values  can  easily  be  found  from  a  table  of  cumulative  standard  normal  probabilities. 
We  then  plot  the  pairs  (qi,  yq})  and  examine  the  resulting  Q-Q  plot  for  linearity. 

Special  graph  paper,  called  normal  probability  paper,  is  available  that  eliminates 
the  need  to  look  up  the  q ,■  values.  We  need  only  plot  (i  —  A )/n  in  place  of  qj .  that 
is,  plot  the  pairs  [(;  —  A )/«,  yq )  ]  and  look  for  linearity  as  before.  As  an  even  eas¬ 
ier  alternative,  most  general-purpose  statistical  software  programs  provide  normal 
probability  plots  of  the  pairs  (qi,  yq)). 

The  Q-Q  plots  provide  a  good  visual  check  on  normality  and  are  considered 
to  be  adequate  for  this  purpose  by  many  researchers.  For  those  who  desire  a  more 
objective  procedure,  several  hypothesis  tests  are  available.  We  give  three  of  these  that 
have  good  properties  and  are  computationally  tractable. 

We  discuss  first  a  classical  approach  based  on  the  following  measures  of  skewness 
and  kurtosis: 


[£"=i(w -302f2 

,  «£”=i0v-y)4 

t>2  =  7 - -T- 

[£)=i  (yi-J)2] 


(4.18) 

(4.19) 


These  are  sample  estimates  of  the  population  skewness  and  kurtosis  parameters  ^ff>\ 
and  P2,  respectively.  When  the  population  is  normal,  V/ST  =  0  and  fh  =  3.  If 
V/3T  <  0,  we  have  negative  skewness;  if  VAT  >  0,  the  skewness  is  positive.  Positive 
skewness  is  illustrated  in  Figure  4.6.  If  /S2  <  3,  we  have  negative  kurtosis,  and  if  fij  > 
3,  there  is  positive  kurtosis.  A  distribution  with  negative  kurtosis  is  characterized  by 
being  flatter  than  the  normal  distribution,  that  is,  less  peaked,  with  heavier  flanks 
and  thinner  tails.  A  distribution  with  positive  kurtosis  has  a  higher  peak  than  the 
normal,  with  an  excess  of  values  near  the  mean  and  in  the  tails  but  with  thinner 
flanks.  Positive  and  negative  kurtosis  are  illustrated  in  Figure  4.7. 

The  test  of  normality  can  be  carried  out  using  the  exact  percentage  points  for  *Jb[ 
in  Table  A. 1  for4  <n  <  25,  as  given  by  Mulholland  (1977).  Alternatively,  for  n  >  8 
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the  function  g,  as  defined  by 


g(y/b 7)  =  <5  sinh  1  . 

is  approximately  IV  (0,  1),  where 


sinh  1  (x)  =  ln(x  +  \A:2  +  1 ). 


(4.20) 


(4.21) 


Table  A.2,  from  D’Agostino  and  Pearson  (1973),  gives  values  for  S  and  1/A.  To  use 
l?2  as  a  test  of  normality,  we  can  use  Table  A. 3,  from  D’Agostino  and  Tietjen  (1971), 
which  gives  simulated  percentiles  of  b2  for  selected  values  of  n  in  the  range  7  <  n  < 
50.  Charts  of  percentiles  of  bj  for  20  <  n  <  200  can  be  found  in  D’Agostino  and 
Pearson  (1973). 


Figure  4.7.  Distributions  with  positive  and  negative  kurtosis  compared  to  the  normal. 
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Our  second  test  for  normality  was  given  by  D’Agostino  (1971).  The  observations 
yi,  y2,  ■  ■  i  ,  y„  are  ordered  as  y(i)  <  y(2 )  <  •  •  •  <  y(„),  and  we  calculate 


D  = 


ELi  ['-4(«  +  d 


y(a) 


T!Uiyi-y)7 


Y  = 


V n[D  -  (2 y/iz) 
.02998598 


(4.22) 


(4.23) 


A  table  of  percentiles  for  Y,  given  by  D’Agostino  (1972)  for  10  <  n  <  250,  is 
provided  in  Table  A.4. 

The  final  test  we  report  is  by  Lin  and  Mudholkar  (1980).  The  test  statistic  is 


z  =  tanh  1 


(4.24) 


where  r  is  the  sample  correlation  of  the  n  pairs  (  v; ,  x;  ),  i  —  1,2,...  .  /j,  with  Xj 
defined  as 


Xj 


(4.25) 


If  the  y’s  are  normal,  z  is  approximately  N (0.  3/ n).  A  more  accurate  upper  100a 
percentile  is  given  by 


[13 

II a  "b  24(^0  2>lla)y2n  , 


(4.26) 


with 

2  3  7.324  53.005  11.70  55.06 

= - 5—  H - 5—,  ua  =  $  (a),  Y2n  = - 1 - y-., 

n  n3  n  n A 

where  is  the  distribution  function  of  the  N( 0,  1)  distribution;  that  is,  <T> (x )  is  the 
probability  of  an  observation  less  than  or  equal  to.r,  as  in  (4.17).  The  inverse  function 
O-1  is  essentially  a  quantile.  For  example,  u  05  =  —1.645  and  n. 95  =  1.645. 

4.4.2  Investigating  Multivariate  Normality 

Checking  for  multivariate  normality  is  conceptually  not  as  straightforward  as  as¬ 
sessing  univariate  normality,  and  consequently  the  state  of  the  art  is  not  as  well 
developed.  The  complexity  of  this  issue  can  be  illustrated  in  the  context  of  a 
goodness-of-fit  test  for  normality.  For  a  goodness-of-fit  test  in  the  univariate  case. 
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the  range  covered  by  a  sample  vi,  >’2,  ■ .  •  ,  yn  is  divided  into  several  intervals,  and 
we  count  how  many  y’s  fall  into  each  interval.  These  observed  frequencies  (counts) 
are  compared  to  the  expected  frequencies  under  the  assumption  that  the  sample 
came  from  a  normal  distribution  with  the  same  mean  and  variance  as  the  sample. 
If  the  n  observations  yi,  y2,  ■  •  •  ,  y„  are  multivariate,  however,  the  procedure  is  not 
so  simple.  We  now  have  a  /2-dimensional  region  that  would  have  to  be  divided  into 
many  more  subregions  than  in  the  univariate  case,  and  the  expected  frequencies  for 
these  subregions  would  be  less  easily  obtained.  With  so  many  subregions,  relatively 
few  would  contain  observations. 

Thus  because  of  the  inherent  “sparseness”  of  multivariate  data,  a  goodness-of-fit 
test  would  be  impractical.  The  points  yi,  y2,  •  •  ■  •  y«  are  more  distant  from  each  other 
in  p-space  than  in  any  one  of  the  p  individual  dimensions.  Unless  n  is  very  large,  a 
multivariate  sample  may  not  provide  a  very  complete  picture  of  the  distribution  from 
which  it  was  taken. 

As  a  consequence  of  the  sparseness  of  the  data  in  /7-space,  the  tests  for  multivari¬ 
ate  normality  may  not  be  very  powerful.  However,  some  check  on  the  distribution  is 
often  desirable.  Numerous  procedures  have  been  proposed  for  assessing  multivariate 
normality.  We  now  discuss  three  of  these. 

The  first  procedure  is  based  on  the  standardized  distance  from  each  y,  to  y, 

Df  =  (y /  -  y)'S" : 1  (y,  -  y) ,  i  =  1 , 2, . . .  ,  n.  (4.27) 

Gnanadesikan  and  Kettenring  (1972)  showed  that  if  the  y,-’s  are  multivariate  normal, 
then 


nDr 

«i  =  7 - ru  (4-28) 

(71  -  1 Y 

has  a  beta  distribution,  which  is  related  to  the  F  distribution.  To  obtain  a  Q—Q  plot, 
the  values  u\,  ui, ...  ,  u„  are  ranked  to  give  W(i)  <  iq 2)  <  •  •  •  <  iq„),  and  we  plot 
(«(,),  Ui),  where  the  quantiles  12,-  of  the  beta  are  given  by  the  solution  to 


T  (a  +  b)  1 

r(a)r(b) 


(l-x)h~l 


dx  — 


i  —  a 

n  —  a  —  P  +  1 


(4.29) 


where  a  —  \p,  b  —  j(n  —  p  —  1), 


P~  2 
2/2 


n  —  p  —  3 
2(72  —  p  —  1) 


(4.30) 

(4.31) 


A  nonlinear  pattern  in  the  plot  would  indicate  a  departure  from  normality.  The  quan¬ 
tiles  of  the  beta  as  defined  in  (4.29)  are  easily  obtained  in  many  software  packages. 
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A  formal  significance  test  is  also  available  for  I)fn)  —  max,  Dj.  Table  A. 6  gives  the 
upper  5%  and  1%  critical  values  for  p  —  2,  3,  4,  5  from  Barnett  and  Lewis  (1978). 

Some  writers  have  suggested  that  the  distribution  of  Dj  in  (4.27)  can  be  ade¬ 
quately  approximated  by  a  xj,  since  (y  —  /jlYX~1  (y  —  fi)  is  xj,  [see  (4.6)].  However, 
in  Section  5.3.2,  it  is  shown  that  this  approximation  is  very  poor  for  even  moderate 
values  of  p.  Small  (1978)  showed  that  plots  of  Dr  vs.  /2  quantiles  are  misleading. 

The  second  procedure  involves  scatter  plots  in  two  dimensions.  If  p  is  not  too 
high,  the  bivariate  plots  of  each  pair  of  variables  are  often  reduced  in  size  and 
shown  on  one  page,  arranged  to  correspond  to  the  entries  in  a  correlation  matrix. 
In  this  visual  matrix,  the  eye  readily  picks  out  those  pairs  of  variables  that  show 
a  curved  trend,  outliers,  or  other  nonnormal  appearance.  This  plot  is  illustrated  in 
Example  4.5.2  in  Section  4.5.2.  The  procedure  is  based  on  properties  4  and  6  of  Sec¬ 
tion  4.2,  from  which  we  infer  that  (1)  each  pair  of  variables  has  a  bivariate  normal 
distribution  and  (2)  bivariate  normal  variables  follow  a  straight-line  trend. 

A  popular  option  in  many  graphical  programs  is  the  ability  to  dynamically 
rotate  a  plot  of  three  variables.  While  the  points  are  rotating  on  the  screen,  a  three- 
dimensional  effect  is  created.  The  shape  of  the  three-dimensional  cloud  of  points  is 
readily  perceived,  and  we  can  detect  various  features  of  the  data.  The  only  drawbacks 
to  this  technique  are  that  (1)  it  is  a  dynamic  display  and  cannot  be  printed  and  (2)  if 
p  is  very  large,  the  number  of  subsets  of  three  variables  becomes  unwieldy,  although 
the  number  of  pairs  may  still  be  tractable  for  plotting.  These  numbers  are  compared 
in  Table  4.1,  where  (^)  and  (^)  represent  the  number  of  subsets  of  sizes  2  and  3, 
respectively.  Thus  in  many  cases,  the  scatter  plots  for  pairs  of  variables  will  continue 
to  be  used,  even  though  three-dimensional  plotting  techniques  are  available. 

The  third  procedure  for  assessing  multivariate  normality  is  a  generalization  of  the 
univariate  test  based  on  the  skewness  and  kurtosis  measures  *Jb\  and  bi  as  given  by 
(4.18)  and  (4.19).  The  test  is  due  to  Mardia  (1970).  Let  y  and  x  be  independent  and 
identically  distributed  with  mean  vector  p.  and  covariance  matrix  X-  Then  skewness 
and  kurtosis  for  multivariate  populations  are  defined  by  Mardia  as 

P Up  =  £[(y  -  m/S”1  (X  -  p)]\  (4.32) 

fil.p  =  E[(y  -  pYX~  1  (y  -  M)]2.  (4.33) 


Table  4.1.  Comparison  of  Number  of  Subsets  of  Sizes 
2  and  3 


P  (?)  (?) 


6 

15 

20 

8 

28 

56 

10 

45 

120 

12 

66 

220 

15 

105 

455 
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Since  third-order  central  moments  for  the  multivariate  normal  distribution  are  zero, 
fii'P  =  0  when  y  is  Np{ijl,  X).  It  can  also  be  shown  that  if  y  is  Np(pi,  X),  then 

lh,p  =  p(p  +  2).  (4.34) 

To  estimate  P\p  and  /F ,P  using  a  sample  yi,  y2, . . .  ,  yp,  we  first  define 

=  (4-35) 


where  X  =  YH=  t  (y/  —  y)(y i  —  y Y/n  is  the  maximum  likelihood  estimator  (4.12). 
Then  estimates  of  /3 i  p  and  @2 ,P  are  given  by 


b\. 


i  n  n 

-±ZE*. 


i= 1  7=1 


i= 1 


(4.36) 

(4.37) 


Table  A. 5  (Mardia  1970,  1974)  gives  percentage  points  of  b\  p  and  for  p  — 
2,  3,  4,  which  can  be  used  in  testing  for  multivariate  normality.  For  other  values  of  p 
or  when  n  >  50,  the  following  approximate  tests  are  available.  For  h \  p,  the  statistic 


(j>+l)(H  +  l)(n  +  3) 

6[(n  +  l)(p+l)-6]  1,p 


(4.38) 


is  approximately  x2  with  ^p(p+  l)(p+2)  degrees  of  freedom.  We  reject  the  hypoth¬ 
esis  of  multivariate  normality  if  z\  >  x  05-  With  bj.p,  on  the  other  hand,  we  wish  to 
reject  for  large  values  (distribution  too  peaked)  or  small  values  (distribution  too  flat). 
For  the  upper  2.5%  points  of  b2,p  use 


bzP  ~  Pip  +  2) 
V8 p(p  +  2 JJn  ' 


(4.39) 


which  is  approximately  N( 0,  1).  For  the  lower  2.5%  points  we  have  two  cases: 
(1)  when  50  <  n  <  400,  use 


t>2,P  ~  P(P  +  2)(«  +  p  +  l)/« 
V8p(/7  +  2)/(«-  1) 


(4.40) 


which  is  approximately  N( 0,  1);  (2)  when  n  >  400,  use  Z2  as  given  by  (4.39). 


4.5  OUTLIERS 

The  detection  of  outliers  has  been  of  concern  to  statisticians  and  other  scientists  for 
over  a  century.  Some  authors  have  claimed  that  the  researcher  can  typically  expect 


100 


THE  MULTIVARIATE  NORMAL  DISTRIBUTION 


up  to  10%  of  the  observations  to  have  errors  in  measurement  or  recording.  Occa¬ 
sional  stray  observations  from  a  different  population  than  the  target  population  are 
also  fairly  common.  We  review  some  major  concepts  and  suggested  procedures  for 
univariate  outliers  in  Section  4.5.1  before  moving  to  the  multivariate  case  in  Sec¬ 
tion  4.5.2.  An  alternative  to  detection  of  outliers  is  to  use  robust  estimators  of  p  and 
X  (see  Rencher  1998,  Section  1.10)  that  are  less  sensitive  to  extreme  observations 
than  are  the  standard  estimators  y  and  S. 

4.5.1  Outliers  in  Univariate  Samples 

Excellent  treatments  of  outliers  have  been  given  by  Beckman  and  Cook  (1983), 
Hawkins  (1980),  and  Barnett  and  Lewis  (1978).  We  abstract  a  few  highlights  from 
Beckman  and  Cook.  Many  techniques  have  been  proposed  for  detecting  outliers  in 
the  residuals  from  regression  or  designed  experiments,  but  we  will  be  concerned  only 
with  simple  random  samples  from  the  normal  distribution. 

There  are  two  principal  approaches  for  dealing  with  outliers.  The  first  is  identifica¬ 
tion,  which  usually  involves  deletion  of  the  outlier(s)  but  may  also  provide  important 
information  about  the  model  or  the  data.  The  second  method  is  accommodation ,  in 
which  the  method  of  analysis  or  the  model  is  modified.  Robust  methods,  in  which  the 
influence  of  outliers  is  reduced,  provide  an  example  of  modification  of  the  analysis. 
An  example  of  a  correction  to  the  model  is  a  mixture  model  that  combines  two  nor¬ 
mals  with  different  variances.  For  example,  Marks  and  Rao  (1978)  accommodated  a 
particular  type  of  outlier  by  a  mixture  of  two  normal  distributions. 

In  small  or  moderate-sized  univariate  samples,  visual  methods  of  identifying 
outliers  are  the  most  frequently  used.  Tests  are  also  available  if  a  less  subjective 
approach  is  desired. 

Two  types  of  slippage  models  have  been  proposed  to  account  for  outliers.  Under 
the  mean  slippage  model,  all  observations  have  the  same  variance,  but  one  or  more  of 
the  observations  arise  from  a  distribution  with  a  different  (population)  mean.  In  the 
variance  slippage  model,  one  or  more  of  the  observations  arise  from  a  model  with 
larger  (population)  variance  but  the  same  mean.  Thus  in  the  mean  slippage  model, 
the  bulk  of  the  observations  arise  from  N(p,o2),  whereas  the  outliers  originate  from 
Nip  +  0,o2).  For  the  variance  slippage  model,  the  main  distribution  would  again 
be  Nip,  a2),  with  the  outliers  coming  from  N(p,  ao2)  where  a  >  1.  These  models 
have  led  to  the  development  of  tests  for  rejection  of  outliers.  We  now  briefly  discuss 
some  of  these  tests. 

For  a  single  outlier  in  a  sample  y\ ,  yi,  ■  ■  ■  ,  Vn ,  most  tests  are  based  on  the  maxi¬ 
mum  studentized  residual, 


max  r,  =  max 
i  i 


(4.41) 


If  the  largest  or  smallest  observation  is  rejected,  one  could  then  examine  the  n  —  1 
remaining  observations  for  another  possible  outlier,  and  so  on.  This  procedure  is 
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called  a  consecutive  test.  However,  if  there  are  two  or  more  outliers,  the  less  extreme 
ones  will  often  make  it  difficult  to  detect  the  most  extreme  one,  due  to  inflation  of 
both  mean  and  variance.  This  effect  is  called  masking. 

Ferguson  (1961)  showed  that  the  maximum  studentized  residual  (4.41)  is  more 
powerful  than  most  other  techniques  for  detecting  intermediate  or  large  shifts  in  the 
mean  and  gave  the  following  guidelines  for  small  shifts: 

1.  For  outliers  with  small  positive  shifts  in  the  mean,  tests  based  on  sample  skew¬ 
ness  are  best. 

2.  For  outliers  with  small  shifts  in  the  mean  in  either  direction,  tests  based  on  the 
sample  kurtosis  are  best. 

3.  For  outliers  with  small  positive  shifts  in  the  variance,  tests  based  on  the  sample 
kurtosis  are  best. 

Because  of  the  masking  problem  in  consecutive  tests,  block  tests  have  been  pro¬ 
posed  for  simultaneous  rejection  of  k  >  1  outliers.  These  tests  work  well  if  k  is 
known,  but  in  practice,  k  is  usually  not  known.  If  the  value  we  conjecture  for  k  is  too 
small,  we  incur  the  risk  of  failing  to  detect  any  outliers  because  of  masking.  If  we 
set  k  too  large,  there  is  a  high  risk  of  rejecting  more  outliers  than  there  really  are,  an 
effect  known  as  swamping. 

4.5.2  Outliers  in  Multivariate  Samples 

In  the  case  of  multivariate  data,  the  problems  in  detecting  outliers  are  intensified  for 
several  reasons: 

1.  For  p  >  2  the  data  cannot  be  readily  plotted  to  pinpoint  the  outliers. 

2.  Multivariate  data  cannot  be  ordered  as  can  a  univariate  sample,  where  extremes 
show  up  readily  on  either  end. 

3.  An  observation  vector  may  have  a  large  recording  error  in  one  of  its  compo¬ 
nents  or  smaller  errors  in  several  components. 

4.  A  multivariate  outlier  may  reflect  slippage  in  mean,  variance,  or  correlation. 
This  is  illustrated  in  Figure  4.8.  Observation  1  causes  a  small  shift  in  means 
and  variances  of  both  y \  and  y2  but  has  little  effect  on  the  correlation.  Obser¬ 
vation  2  has  little  effect  on  means  and  variances,  but  it  reduces  the  correlation 
somewhat.  Observation  3  has  a  major  effect  on  means,  variances,  and  correla¬ 
tion. 

One  approach  to  multivariate  outlier  identification  or  accommodation  is  to  use 
robust  methods  of  estimation.  Such  methods  minimize  the  influence  of  outliers  in 
estimation  or  model  fitting.  However,  an  outlier  sometimes  furnishes  valuable  infor¬ 
mation,  and  the  specific  pursuit  of  outliers  can  be  very  worthwhile. 

We  present  two  methods  of  multivariate  outlier  identification,  both  of  which  are 
related  to  methods  of  assessing  multivariate  normality.  (A  third  approach  based 
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yi 


•  3 


•  1 

- y2 

Figure  4.8.  Bivariate  sample  showing  three  types  of  outliers. 


on  principal  components  is  given  in  Section  12.4.)  The  first  method,  due  to  Wilks 
(1963),  is  designed  for  detection  of  a  single  outlier.  Wilks’  statistic  is 


w  =  max 


|(n  —  2)S_;  | 
T  \(n  —  1  )S |  ’ 


(4.42) 


where  S  is  the  usual  sample  covariance  matrix  and  S_,  is  obtained  from  the  same 
sample  with  the  /  th  observation  deleted.  The  statistic  w  can  also  be  expressed  in 
terms  of  D2(n)  =  max,-  (y,-  -  y)'S_1  (y,-  -  y)  as 


w  —  1  — 


nDl) 

(n  —  l)2  ’ 


(4.43) 


thus  basing  a  test  for  an  outlier  on  the  distances  Z>?  used  in  Section  4.4.2  in  a  graphi¬ 
cal  procedure  for  checking  multivariate  normality.  Table  A.6  gives  the  upper  5%  and 
1%  critical  values  for  D2n)  from  Barnett  and  Lewis  (1978). 

Yang  and  Lee  (1987)  provide  an  F-test  of  w  as  given  by  (4.43).  Define 


Ft  = 


1 

1  —  nDj /{n  —  l)2 


i  =  1,2,...  ,n.  (4.44) 


Then  the  F,-’s  are  independently  and  identically  distributed  as  Fpjl-p-\,  and  a  test 
can  be  constructed  in  terms  of  max,  F, : 


F(max  F;  >/)=!-  F(all  F,  </)  =  !-  [P(F  <  /)]». 
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Therefore,  the  test  can  be  carried  out  using  an  E-table.  Note  that 


max  Fj  =  F(n)  — 


(4.45) 


where  w  is  given  in  (4.43). 

The  second  test  we  discuss  is  designed  for  detection  of  several  outliers.  Schwager 
and  Margolin  (1982)  showed  that  the  locally  best  invariant  test  for  mean  slippage 
is  based  on  Mardia’s  (1970)  sample  kurtosis  b2,p  as  defined  by  (4.35)  and  (4.37). 
To  be  more  specific,  among  all  tests  invariant  to  a  class  of  transformations  of  the 
type  z  =  Ay  +  b,  where  A  is  nonsingular  (see  Problem  4.8),  the  test  using  bj.p  is 
most  powerful  for  small  shifts  in  the  mean  vector.  This  result  holds  if  the  proportion 
of  outliers  is  no  more  than  21.13%.  With  some  restrictions  on  the  pattern  of  the 
outliers,  the  permissible  fraction  of  outliers  can  go  as  high  as  33  The  hypothesis 
is  Hq:  no  outliers  are  present.  This  hypothesis  is  rejected  for  large  values  of  bi.p. 

A  table  of  critical  values  of  hip,  and  some  approximate  tests  were  described  in 
Section  4.4.2  following  (4.37).  Thus  the  test  doubles  as  a  check  for  multivariate 
normality  and  for  the  presence  of  outliers.  One  advantage  of  this  test  for  outliers  is 
that  we  do  not  have  to  specify  the  number  of  outliers  and  run  the  attendant  risk  of 
masking  or  swamping.  Schwager  and  Margolin  (1982)  pointed  out  that  this  feature 
“increases  the  importance  of  performing  an  overall  test  that  is  sensitive  to  a  broad 
range  of  outlier  configurations.  There  is  also  empirical  evidence  that  the  kurtosis  test 
performs  well  in  situations  of  practical  interest  when  compared  with  other  inferential 
outlier  procedures.” 

Sinha  (1984)  extended  the  result  of  Schwager  and  Margolin  to  cover  the  general 
case  of  elliptically  symmetric  distributions.  An  elliptically  symmetric  distribution  is 
one  in  which  /( y)  =  |2|“1,/2g[(y  —  y  —  /u.)].  By  varying  the  function  g, 

distributions  with  shorter  or  longer  tails  than  the  normal  can  be  obtained.  The  critical 
value  of  b2,p  must  be  adjusted  to  correspond  to  the  distribution,  but  rejection  for 
large  values  would  be  a  locally  best  invariant  test. 


Example  4.5.2.  We  use  the  ramus  bone  data  set  of  Table  3.6  to  illustrate  a  search 
for  multivariate  outliers,  while  at  the  same  time  checking  for  multivariate  normality. 
An  examination  of  each  column  of  Table  3.6  does  not  reveal  any  apparent  univariate 
outliers.  To  check  for  multivariate  outliers,  we  first  calculate  Dj  in  (4.27)  for  each 
observation  vector.  The  results  are  given  in  Table  4.2.  We  see  that  Dg ,  Z)p,  and  D2Hj 
seem  to  stand  out  as  possible  outliers.  In  Table  A. 6,  the  upper  5%  critical  value  for  the 
maximum  value,  D22(). ,  is  given  as  1 1.63.  In  our  case,  the  largest  l)j  is  Dg  =  1 1.03, 
which  does  not  exceed  the  critical  value.  This  does  not  surprise  us,  since  the  test  was 
designed  to  detect  a  single  outlier,  and  we  may  have  as  many  as  three. 

We  compute  Uj  and  v,  in  (4.28)  and  (4.29)  and  plot  them  in  Figure  4.9.  The  figure 
shows  a  departure  from  linearity  due  to  three  values  and  possibly  a  fourth. 
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Table  4.2.  Values  of  D]  for  the  Ramus  Bone  Data  in  Table  3.6 


Observation 

Number 

Dj 

Observation 

Number 

Df 

1 

.7588 

11 

2.8301 

2 

1.2980 

12 

10.5718 

3 

1.7591 

13 

2.5941 

4 

3.8539 

14 

.6594 

5 

.8706 

15 

.3246 

6 

2.8106 

16 

.8321 

7 

4.2915 

17 

1.1083 

8 

7.9897 

18 

4.3633 

9 

11.0301 

19 

2.1088 

10 

5.3519 

20 

10.0931 

We  next  calculate  b\p  and  b2^p,  as  given  by  (4.36)  and  (4.37): 

bhp  =  11.338,  b2,P  =  28.884. 

In  Table  A.5,  the  upper  .01  critical  value  for  b ip  is  9.9;  the  upper  .005  critical  value 
for  b2,P  is  27.1.  Thus  both  b i  p  and  b2  p  exceed  their  critical  values,  and  we  have 
significant  skewness  and  kurtosis,  apparently  caused  by  the  three  observations  with 
large  values  of  Dj . 
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Figure  4.9.  Q-Q  plot  of  u,  and  u,  for  the  ramus  bone  data  of  Table  3.6. 
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y  4 


y  i  yi  yz  ^4 

Figure  4.10.  Scatter  plots  for  the  ramus  bone  data  in  Table  3.6. 


The  bivariate  scatter  plots  are  given  in  Figure  4.10.  Three  values  are  clearly  sep¬ 
arate  from  the  other  observations  in  the  plot  of  y\  versus  >’4.  In  Table  3.6,  the  9th, 
12th,  and  20th  values  of  V4  are  not  unusual,  nor  are  the  9th,  12th,  and  20th  values  of 
y\ .  However,  the  increase  from  y\  to  V4  is  exceptional  in  each  case.  If  these  values 
are  not  due  to  errors  in  recording  the  data  and  if  this  sample  is  representative,  then 
we  appear  to  have  a  mixture  of  two  populations.  This  should  be  taken  into  account 
in  making  inferences.  □ 


PROBLEMS 

4.1  Consider  the  two  covariance  matrices 

/  14  8  3  \  /  6  6  1  \ 

£1  =  I  8  5  2,  £2=682. 

\  3  2  1  /  \  1  2  1  / 

Show  that  | £2 1  >  |£  1 1  and  that  tr(£2)  <  tr(£i).  Thus  the  generalized  variance 
of  population  2  is  greater  than  the  generalized  variance  of  population  1,  even 
though  the  total  variance  is  less.  Comment  on  why  this  is  true  in  terms  of  the 
variances  and  correlations. 
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4.2  For  z  =  ( T' )  1  (y  —  p)  in  (4.4),  show  that  E( z)  =  0  and  cov(z)  =  I. 

4.3  Show  that  the  form  of  the  likelihood  function  in  (4.13)  follows  from  the  previ¬ 
ous  expression. 

4.4  For  (y—  pYX~l(y  —  p)  in  (4.3)  and  (4.6),  show  that  £[(y  —  pY^~l  (y—  ju)]  = 
p.  Assume  E(y)  —  p  and  cov(y)  =  2.  Normality  is  not  required. 

4.5  Show  that  by  adding  and  subtracting  y,  the  exponent  of  (4.13)  has  the  form 
given  in  (4.14),  that  is, 

\  Y^(y i  -  y  +  y -  (y,-  -y  +  y-  p)  =  ^  ^(y;  -  y)'2_1  (y/  -  y) 

z  i=i  z  i=i 

4.6  Show  that  «Jb\  and  1)2,  as  given  in  (4.18)  and  (4.19),  are  invariant  to  the  trans¬ 
formation  74  =  ayi  +  b. 

4.7  Show  that  if  y  is  Np(p ,  2),  then  f$2,P  —  pip  +  2)  as  in  (4.34). 

4.8  Show  that  b\,p  and  b),P,  as  given  by  (4.36)  and  (4.37),  are  invariant  under  the 
transformation  z,-  =  Ay,  +  b,  where  A  is  nonsingular.  Thus  b\  p  and  b2,P  do 
not  depend  on  the  units  of  measurement. 

4.9  Show  that  F(n)  —  [(n  —  p  —  \)/p]{\/w  —  1)  as  in  (4.45). 

4.10  Suppose  y  is  N^ip,  2),  where 


(a)  Find  the  distribution  of  z  =  2yi  —  +  3y3- 

(b)  Find  the  joint  distribution  of  zi  —  yi  +  yi  +  yi  and  zi  =  yi  —  y2  +  2y3. 

(c)  Find  the  distribution  of  vt. 

(d)  Find  the  joint  distribution  of  yi  and  V3 . 

(e)  Find  the  joint  distribution  of  yi,  y3,  and  ^(yi  +  yi). 

4.11  Suppose  y  is  N^ip,  2),  with  p  and  2  given  in  the  previous  problem. 

(a)  Find  a  vector  z  such  that  z  =  (T')— 1  (y  —  p)  is  /VAO,  I)  as  in  (4.4). 

(b)  Find  a  vector  z  such  that  z  =  (21//2)-1  (y  —  p)  is  /V3 (0, 1)  as  in  (4.5). 

(c)  What  is  the  distribution  of  (y  —  pY 2-1(y  —  p )? 

4.12  Suppose  y  is  N4(p1  2),  where 

/II  —8  3  9  \ 

-8  9-36 

3-3  2  3  • 

v  9  6  3  9  / 


-2  \ 
3 

-1 
V  5/ 


2  = 
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(a)  Find  the  distribution  of  z  —  4yi  —  2y2  +  y3  —  3y4. 

(b)  Find  the  joint  distribution  of  zi  —  yi  +  yi  +  y3  +  y4  and  Z2  =  —  2y\  + 

3.V2  +  y3  -  2y4. 

(c)  Find  the  joint  distribution  of  z\  =  3yi  +  yi  —  4y3  —  y4,  zi  =  —  yi  —  3y2  + 
y3  -  2y4,  and  z3  =  2yi  +  2_y2  +  4y3  -  5.V4- 

(d)  What  is  the  distribution  of  V3  ? 

(e)  What  is  the  joint  distribution  of  y2  and  V4? 

(f)  Find  the  joint  distribution  of  yi,  ^-(yi  +  y2),  ^(yi  +  >’2  +  ^3),  and  |(yi  + 
yi  +  y3  +  y4)- 

4.13  Suppose  y  is  2)  with  /x  and  2  given  in  the  previous  problem. 

(a)  Find  a  vector  z  such  that  z  =  (T')~ 1  (v  —  /a)  is  N4(0, 1)  as  in  (4.4). 

(b)  Find  a  vector  z  such  that  z  =  (21/,2)_1(y  —  fx)  is  A^4(0, 1)  as  in  (4.5). 

(c)  What  is  the  distribution  of  (y  —  /x)1 2-1(y  —  fx)2 

4.14  Suppose  y  is  N^i/x,  2),  with 


Which  of  the  following  random  variables  are  independent? 

(a)  y  1  and  y2 

(b)  yi  and  y3 

(c)  y2  and  y3 

(d)  (yi,y2)andy3 

(e)  (yi,y3)andy2 

4.15  Suppose  y  is  N4(/x,  2),  with 


/  -4  \ 

( 

8 

0 

-1 

0  \ 

2 

0 

3 

0 

2 

5 

,  2  = 

-1 

0 

5 

0 

V  1  / 

\ 

0 

2 

0 

7  / 

Which  of  the  following  random  variables  are  independent? 


(a)  yi  and  y2 

(b)  yi  and  y3 

(c)  yi  and  y4 

(d)  yi  and  y3 

(e)  y2  and  y4 


(f)  y3  and  y4 

(g)  (>’1,  y2)  and  y3 

(h)  (yi,  y2)  and  y4 

(i)  (yi,y3)andy4 

(j)  yi  and  (y2,  y4) 


(k)  yi  and  y2  and  y3 

(l)  yi  and  y2  and  y4 

(m)  (y2,  y2)  and  (y3,  y4) 

(n)  (yi,  y3)  and  (y2,  y4) 


108 


THE  MULTIVARIATE  NORMAL  DISTRIBUTION 


4.16  Assume  y  and  x  are  subvectors,  each  2x1,  where  (J)  is  2)  with 


(a)  Find  £(y|x)  by  (4.7). 

(b)  Find  cov(y|x)  by  (4.8). 

4.17  Suppose  y  and  x  are  subvectors,  such  that  y  is  2  x  1  and  x  is  3  x  1,  with  /x  and 
2  partitioned  accordingly: 


Assume  that  (^)  is  distributed  as  A^5(/a,  2). 

(a)  Find  £(y|x)  by  (4.7). 

(b)  Find  cov(y|x)  by  (4.8). 

4.18  Suppose  that  yi,  yi, . . .  ,  y„  is  a  random  sample  from  a  nonnormal  multivariate 
population  with  mean  /x  and  covariance  matrix  2.  If  n  is  large,  what  is  the 
approximate  distribution  of  each  of  the  following? 

(a)  Vn(y  —  /x) 

(b)  y 

4.19  For  the  ramus  bone  data  treated  in  Example  4.5.2,  check  each  of  the  four  vari¬ 
ables  for  univariate  normality  using  the  following  techniques: 

(a)  Q-Q  plots; 

(b)  and  bj  as  given  by  (4.18)  and  (4.19); 

(c)  D’Agostino’s  test  using  D  and  Y  given  in  (4.22)  and  (4.23); 

(d)  The  test  by  Lin  and  Mudholkar  using  z  defined  in  (4.24). 

4.20  For  the  calcium  data  in  Table  3.3,  check  for  multivariate  normality  and  outliers 
using  the  following  tests: 

(a)  Calculate  Z>?  as  in  (4.27)  for  each  observation. 

(b)  Compare  the  largest  value  of  Z>?  with  the  critical  value  in  Table  A. 6. 

(c)  Compute  Uj  and  v,  in  (4.28)  and  (4.29)  and  plot  them.  Is  there  an  indication 
of  nonlinearity  or  outliers? 

(d)  Calculate  b\p  and  bi,P  in  (4.36)  and  (4.37)  and  compare  them  with  critical 
values  in  Table  A. 5. 
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4.21  For  the  probe  word  data  in  Table  3.5,  check  each  of  the  five  variables  for  uni¬ 
variate  normality  and  outliers  using  the  following  tests: 

(a)  Q-Q  plots; 

(b)  Vb[  and  bi  as  given  by  (4.18)  and  (4.19); 

(c)  D’Agostino’s  test  using  D  and  Y  given  in  (4.22)  and  (4.23); 

(d)  The  test  by  Lin  and  Mudholkar  using  z  defined  in  (4.24). 

4.22  For  the  probe  word  data  in  Table  3.5,  check  for  multivariate  normality  and 
outliers  using  the  following  tests: 

(a)  Calculate  Dj  as  in  (4.27)  for  each  observation. 

(b)  Compare  the  largest  value  of  Dj  with  the  critical  value  in  Table  A. 6. 

(c)  Compute  m,  and  u;  in  (4.28)  and  (4.29)  and  plot  them.  Is  there  an  indication 
of  nonlinearity  or  outliers? 

(d)  Calculate  bi  p  and  bz,P  in  (4.36)  and  (4.37)  and  compare  them  with  critical 
values  in  Table  A. 5. 

4.23  Six  hematology  variables  were  measured  on  51  workers  (Royston  1983): 

y\  =  hemoglobin  concentration  >’4  =  lymphocyte  count 

V2  =  packed  cell  volume  y$  —  neutrophil  count 

y3  =  white  blood  cell  count  yg  =  serum  lead  concentration 

The  data  are  given  in  Table  4.3.  Check  each  of  the  six  variables  for  univariate 
normality  using  the  following  tests: 

(a)  Q-Q  plots; 

(b)  \fb[  and  bz  as  given  by  (4.18)  and  (4.19); 

(c)  D’Agostino’s  test  using  D  and  Y  given  in  (4.22)  and  (4.23); 

(d)  The  test  by  Lin  and  Mudholkar  using  z  defined  in  (4.24). 


Table  4.3.  Hematology  Data 


Observation 

Number 

Vl 

V2 

lb 

y 4 

Vs 

ye 

1 

13.4 

39 

4100 

14 

25 

17 

2 

14.6 

46 

5000 

15 

30 

20 

3 

13.5 

42 

4500 

19 

21 

18 

4 

15.0 

46 

4600 

23 

16 

18 

5 

14.6 

44 

5100 

17 

31 

19 

6 

14.0 

44 

4900 

20 

24 

19 

7 

16.4 

49 

4300 

21 

17 

18 

8 

14.8 

44 

4400 

16 

26 

29 

9 

15.2 

46 

4100 

27 

13 

27 

10 

15.5 

48 

8400 

34 

42 

36 

(continued) 
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Table  4.3.  (Continued) 


Observation 

Number 

yi 

T2 

,V3 

V4 

T5 

T6 

11 

15.2 

47 

26 

27 

22 

12 

16.9 

50 

5100 

28 

17 

23 

13 

14.8 

44 

4700 

24 

20 

23 

14 

16.2 

45 

5600 

26 

25 

19 

15 

14.7 

43 

4000 

23 

13 

17 

16 

14.7 

42 

3400 

9 

22 

13 

17 

16.5 

45 

5400 

18 

32 

17 

18 

15.4 

45 

6900 

28 

36 

24 

19 

15.1 

45 

4600 

17 

29 

17 

20 

14.2 

46 

4200 

14 

25 

28 

21 

15.9 

46 

5200 

8 

34 

16 

22 

16.0 

47 

4700 

25 

14 

18 

23 

17.4 

50 

8600 

37 

39 

17 

24 

14.3 

43 

5500 

20 

31 

19 

25 

14.8 

44 

4200 

15 

24 

29 

26 

14.9 

43 

4300 

9 

32 

17 

27 

15.5 

45 

5200 

16 

30 

20 

28 

14.5 

43 

3900 

18 

18 

25 

29 

14.4 

45 

6000 

17 

37 

23 

30 

14.6 

44 

4700 

23 

21 

27 

31 

15.3 

45 

7900 

43 

23 

23 

32 

14.9 

45 

3400 

17 

15 

24 

33 

15.8 

47 

6000 

23 

32 

21 

34 

14.4 

44 

7700 

31 

39 

23 

35 

14.7 

46 

3700 

11 

23 

23 

36 

14.8 

43 

5200 

25 

19 

22 

37 

15.4 

45 

6000 

30 

25 

18 

38 

16.2 

50 

8100 

32 

38 

18 

39 

15.0 

45 

4900 

17 

26 

24 

40 

15.1 

47 

6000 

22 

33 

16 

41 

16.0 

46 

4600 

20 

22 

22 

42 

15.3 

48 

5500 

20 

23 

23 

43 

14.5 

41 

6200 

20 

36 

21 

44 

14.2 

41 

4900 

26 

20 

20 

45 

15.0 

45 

7200 

40 

25 

25 

46 

14.2 

46 

5800 

22 

31 

22 

47 

14.9 

45 

8400 

61 

17 

17 

48 

16.2 

48 

3100 

12 

15 

18 

49 

14.5 

45 

4000 

20 

18 

20 

50 

16.4 

49 

6900 

35 

22 

24 

51 

14.7 

44 

7800 

38 

34 

16 
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4.24  For  the  hematology  data  in  Table  4.3,  check  for  multivariate  normality  using 

the  following  techniques: 

(a)  Calculate  Dr  as  in  (4.27)  for  each  observation. 

(b)  Compare  the  largest  value  of  Dj  with  the  critical  value  in  Table  A. 6 
(extrapolate). 

(c)  Compute  u,  and  u/  in  (4.28)  and  (4.29)  and  plot  them.  Is  there  an  indication 
of  nonlinearity  or  outliers? 

(d)  Calculate  b ip  and  bj.p  in  (4.36)  and  (4.37)  and  compare  them  with  critical 
values  in  Table  A. 5. 


CHAPTER  5 


Tests  on  One  or  Two  Mean  Vectors 


5.1  MULTIVARIATE  VERSUS  UNIVARIATE  TESTS 

Hypothesis  testing  in  a  multivariate  context  is  more  complex  than  in  a  univariate 
setting.  The  number  of  parameters  may  be  staggering.  The  /^-variate  normal  distri¬ 
bution,  for  example,  has  p  means,  p  variances,  and  (^)  covariances,  where  (T)  rep¬ 
resents  the  number  of  pairs  among  the  p  variables.  The  total  number  of  parameters 
is 


P  +  P+(^)=\p^  +  ^ 

For  p  =  10,  for  example,  the  number  of  parameters  is  65,  for  each  of  which, 
a  hypothesis  could  be  formulated.  Additionally,  we  might  be  interested  in  testing 
hypotheses  about  subsets  of  these  parameters  or  about  functions  of  them.  In  some 
cases,  we  have  the  added  dilemma  of  choosing  among  competing  test  statistics  (see 
Chapter  6). 

We  first  discuss  the  motivation  for  testing  p  variables  multivariately  rather  than 
(or  in  addition  to)  univariately,  as,  for  example,  in  hypotheses  about  p,\,  p,2, . . .  ,  pp 
in  ft.  There  are  at  least  four  arguments  for  a  multivariate  approach  to  hypothesis 
testing: 

1.  The  use  of  p  univariate  tests  inflates  the  Type  I  error  rate,  a,  whereas  the 
multivariate  test  preserves  the  exact  a  level.  For  example,  if  we  do  p  —  10 
separate  univariate  tests  at  the  .05  level,  the  probability  of  at  least  one  false 
rejection  is  greater  than  .05.  If  the  variables  were  independent  (they  rarely 
are),  we  would  have  (under  Ho) 

P( at  least  one  rejection)  =  1  —  /hall  10  tests  accept  Ho) 

=  1  -  (.95) 10  =  .40. 
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The  resulting  overall  a  of  .40  is  not  an  acceptable  error  rate.  Typically,  the  10 
variables  are  correlated,  and  the  overall  a  would  lie  somewhere  between  .05 
and  .40. 

2.  The  univariate  tests  completely  ignore  the  correlations  among  the  variables, 
whereas  the  multivariate  tests  make  direct  use  of  the  correlations. 

3.  The  multivariate  test  is  more  powerful  in  many  cases.  The  power  of  a  test  is 
the  probability  of  rejecting  Hq  when  it  is  false.  In  some  cases,  all  p  of  the 
univariate  tests  fail  to  reach  significance,  but  the  multivariate  test  is  significant 
because  small  effects  on  some  of  the  variables  combine  to  jointly  indicate  sig¬ 
nificance.  However,  for  a  given  sample  size,  there  is  a  limit  to  the  number  of 
variables  a  multivariate  test  can  handle  without  losing  power.  This  is  discussed 
further  in  Section  5.3.2. 

4.  Many  multivariate  tests  involving  means  have  as  a  byproduct  the  construction 
of  a  linear  combination  of  variables  that  reveals  more  about  how  the  variables 
unite  to  reject  the  hypothesis. 


5.2  TESTS  ON  fx  WITH  X  KNOWN 

The  test  on  a  mean  vector  assuming  a  known  X  is  introduced  to  illustrate  the  issues 
involved  in  multivariate  testing  and  to  serve  as  a  foundation  for  the  unknown  X  case. 
We  first  review  the  univariate  case,  in  which  we  work  with  a  single  variable  y  that  is 
distributed  as  N (pi,  cr 2 ). 


5.2.1  Review  of  Univariate  Test  for  Ho:  //  =  no  with  a  Known 

The  hypothesis  of  interest  is  that  the  mean  of  y  is  equal  to  a  given  value,  no,  versus 
the  alternative  that  it  is  not  equal  to  no- 


H0\n  =  ^o  vs.  //i : /x  / /z0. 


We  do  not  consider  one-sided  alternative  hypotheses  because  they  do  not  readily 
generalize  to  multivariate  tests.  We  assume  a  random  sample  of  n  observations  yi, 
y2,  ■  ■  ■  ,  }!n  from  N(ji,  er2)  with  cr2  known.  We  calculate  y  —  Y11=i  yi ln  and  com¬ 
pare  it  to  no  using  the  test  statistic 


y  ~  Mo  _  y  ~  Mo 

cry  o/*Jn  ’ 


(5.1) 


which  is  distributed  as  N( 0,  1)  if  Hq  is  true.  For  a  —  .05,  we  reject  Hq  if  |z|  >  1.96. 
Equivalently,  we  can  use  z  ,  which  is  distributed  as  x2  with  one  degree  of  freedom, 
and  reject  Hq  if  z2  >  (1.96)2  =  3.84.  If  n  is  large,  we  are  assured  by  the  central 
limit  theorem  that  z  is  approximately  normal,  even  if  the  observations  are  not  from  a 
normal  distribution. 
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5.2.2  Multivariate  Test  for  Hq:  m  =  Mo  with  X  Known 

In  the  multivariate  case  we  have  several  variables  measured  on  each  sampling  unit, 
and  we  wish  to  hypothesize  a  value  for  the  mean  of  each  variable,  Hq  :  ju  =  mo  vs. 
H\ :  fx  ^  fjiQ.  More  explicitly,  we  have 


(  Ml  ) 

(  Mot  ) 

(  Ml  ) 

(  Mot  ) 

Ho'. 

M2 

= 

M02 

Hi: 

M2 

* 

M02 

V  M/?  /  V  Ah)/?  /  V  A*/?  /  \  Mo p  / 

where  each  MOj  is  specified  from  previous  experience  or  is  a  target  value.  The  vector 
equality  in  Hq  implies  /i  j  =  /iqj  for  all  j  =  1,2,...  ,  p.  The  vector  inequality  in 
H\  implies  at  least  one  jij  ^  MO; .  Thus,  for  example,  if  //  /  =  mo/  for  all  j  except 
2,  for  which  M2  M 02.  then  we  wish  to  reject  Ho- 

To  test  Hq,  we  use  a  random  sample  of  n  observation  vectors  yi,  y->,  . . .  ,  y„  from 
NP  (pc,  X),  with  X  known,  and  calculate  y  =  J i / n ■  The  test  statistic  is 

z2  =  «(y- Mo)'X_1(y- Mo)-  (5.2) 

If  Hq  is  true,  Z2  is  distributed  as  /2  by  (4.6),  and  we  therefore  reject  Hq  if  Z2  > 
X2  p.  Note  that  for  one  variable,  z1  [the  square  of  (5.1)]  has  a  chi-square  distribution 
with  1  degree  of  freedom,  whereas,  for  p  variables,  Z2  in  (5.2)  is  distributed  as  a 
chi-square  with  p  degrees  of  freedom. 

If  X  is  unknown,  we  could  use  S  in  its  place  in  (5.2),  and  Z2  would  have  an 
approximate  / 2 -distribution.  But  n  would  have  to  be  larger  than  in  the  analogous 
univariate  situation,  in  which  t  —  (y  —  p.o) / (s / ^/n)  is  approximately  N( 0,  1)  for 
n  >  30.  The  value  of  n  needed  for  n(y  —  mo/S-1  (y  —  mo)  to  have  an  approximate 
X 2 -distribution  depends  on  p.  This  is  clarified  further  in  Section  5.3.2. 


Example  5.2.2.  In  Table  3.1,  height  and  weight  were  given  for  a  sample  of  20 
college-age  males.  Let  us  assume  that  this  sample  originated  from  the  bivariate  nor¬ 
mal  A^2(m-  X),  where 


f  20  100  \ 

\  100  1000  )  ' 


Suppose  we  wish  to  test  Ho'.  M  =  (70,  170)'.  From  Example  3.2.1,  y{  =  71.45  and 
y2  —  164.7.  We  thus  have 


Z2=n( y  — mo)'X  J(y-Mo) 


=  (20) 


71.45  -70  Y/ 
164.7  -  170  )  \ 


=  (20)(1 .45,  —5.3) 


.1 

-.01 


20  100  \  1 
100  1000  ) 

-.01  V  1.45 
.002  -5.3 


71.45-  70  \ 
164.7  -  170  ) 

=  8.4026. 
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Using  a  =  .05,  /q5  2  =  5.99,  and  we  therefore  reject  Ho:  fx  =  (70,  170)'  because 
Z2  =  8.4026  >  5.99. 

The  rejection  region  for  y  =  (T| ,  y2)'  is  on  or  outside  the  ellipse  in  Figure  5.1; 
that  is,  the  test  statistic  Z2  is  greater  than  5.99  if  and  only  if  y  is  outside  the  ellipse.  If 
y  falls  inside  the  ellipse,  Hq  is  accepted.  Thus,  distance  from  n o  as  well  as  direction 
must  be  taken  into  account.  When  the  distance  is  standardized  by  X-1,  all  points  on 
the  curve  are  “statistically  equidistant”  from  the  center. 

Note  that  the  test  is  sensitive  to  the  covariance  structure.  If  cov(vi,  jt)  were  neg¬ 
ative,  V2  would  tend  to  decrease  as  yi  increases,  and  the  ellipse  would  be  tilted  in  the 
other  direction.  In  this  case,  y  would  be  in  the  acceptance  region. 

Let  us  now  investigate  the  consequence  of  testing  each  variable  separately.  Using 
Za/2  =  1.96  for  a  =  .05,  we  have 


Vi  ~  M oi 
cri/^/n 


1.450  <  1.96, 


y_i  -  MQ2 

02/V" 


-.7495  >  -1.96. 


Thus  both  tests  accept  the  hypothesis.  In  this  case  neither  of  the  y’s  is  far  enough 
from  the  hypothesized  value  to  cause  rejection.  But  when  the  positive  correlation 
between  yi  and  V2  is  taken  into  account  in  the  multivariate  test,  the  two  evidences 


Figure  5.1.  Elliptical  acceptance  region. 
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y\ 


Figure  5.2.  Acceptance  and  rejection  regions  for  univariate  and  multivariate  tests. 


against  /ip  combine  to  cause  rejection.  This  illustrates  the  third  advantage  of  multi¬ 
variate  tests  given  in  Section  5.1. 

Figure  5.2  shows  the  rectangular  acceptance  region  for  the  univariate  tests  super¬ 
imposed  on  the  elliptical  multivariate  acceptance  region.  The  rectangle  was  obtained 
by  calculating  the  two  acceptance  regions 


Mot  -  1.96— 
V« 

<72 

[702  -  1.96  — 
s/n 


<  Ti  <  M 01  +  1-96—, 
V« 

_  02 
<yi<  M 02  +  1-96—. 

s/n 


Points  inside  the  ellipse  but  outside  the  rectangle  will  be  rejected  in  at  least  one  uni¬ 
variate  dimension  but  will  be  accepted  multivariately.  This  illustrates  the  inflation 
of  a  resulting  from  univariate  tests,  as  discussed  in  the  first  motive  for  multivariate 
testing  in  Section  5.1.  This  phenomenon  has  been  referred  to  as  Rao’s  paradox.  For 
further  discussion  see  Rao  (1966),  Healy  (1969),  and  Morrison  (1990,  p.  174).  Points 
outside  the  ellipse  but  inside  the  rectangle  will  be  rejected  multivariately  but  accepted 
univariately  in  both  dimensions.  This  illustrates  the  third  motive  for  multivariate  test¬ 
ing  given  in  Section  5.1,  namely,  that  the  multivariate  test  is  more  powerful  in  some 
situations. 

Thus  in  either  case  represented  by  the  shaded  areas,  we  should  use  the  multivari¬ 
ate  test  result,  not  the  univariate  results.  In  the  one  case,  the  multivariate  test  is  more 
powerful  than  the  univariate  tests;  in  the  other  case,  the  multivariate  test  preserves  a 
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whereas  the  univariate  tests  inflate  a.  Consequently,  when  the  multivariate  and  uni¬ 
variate  results  disagree,  our  tendency  is  to  trust  the  multivariate  result.  In  Section  5.5, 
we  discuss  various  procedures  for  ascertaining  the  contribution  of  the  individual  vari¬ 
ables  after  the  multivariate  test  has  rejected  the  hypothesis.  □ 


5.3  TESTS  ON  fx  WHEN  2  IS  UNKNOWN 

In  Section  5.2,  we  said  little  about  properties  of  the  tests,  because  the  tests  discussed 
were  of  slight  practical  consequence  due  to  the  assumption  that  2  is  known.  We  will 
be  more  concerned  with  test  properties  in  Sections  5.3  and  5.4,  first  in  the  one-sample 
case  and  then  in  the  two-sample  case.  The  reader  may  wonder  why  we  include  one- 
sample  tests,  since  we  seldom,  if  ever,  have  need  of  a  test  for  II o :  p.  =  /xq.  However, 
we  will  cover  this  case  for  two  reasons: 

1.  Many  general  principles  are  more  easily  illustrated  in  the  one-sample  frame¬ 
work  than  in  the  two-sample  case. 

2.  Some  very  useful  tests  can  be  cast  in  the  one-sample  framework.  Two  examples 
are  (1 )  Hq  :  fx(j  =  0  used  in  the  paired  comparison  test  covered  in  Section  5.7 
and  (2)  Ho :  C/x  =  0  used  in  profile  analysis  in  Section  5.9,  in  analysis  of 
repeated  measures  in  Section  6.9,  and  in  growth  curves  in  Section  6.10. 

5.3.1  Review  of  Univariate  f-Test  for  Ho:  p  —  no  with  a  Unknown 

We  first  review  the  familiar  one-sample  f-test  in  the  univariate  case,  with  only  one 
variable  measured  on  each  sampling  unit.  We  assume  that  a  random  sample  yi, 
yi, . . .  ,  yn  is  available  from  N(p,  ct2).  We  estimate  fx  by  y  and  a2  by  s2,  where 
y  and  s2  are  given  by  (3.1)  and  (3.4).  To  test  Hq  :  p  =  p o  vs.  H\  :  fx  p o,  we  use 

,  =  1 -m  Ukt-m)  (5.3) 

s/+/n  s 

If  Hq  is  true,  t  is  distributed  as  f„  _  i ,  where  n  —  1  is  the  degrees  of  freedom.  We  reject 
Hq  if  | ~Jn(y  —  p o)AI  >  f«/2,n-i ,  where  ta/2,n-i  is  a  critical  value  from  the  f-table. 

The  first  expression  in  (5.3),  t  =  (y  —  Mo)  /  (s  / \/n) ,  is  the  characteristic  form  of 
the  f-statistic,  which  represents  a  sample  standardized  distance  between  y  and  mo-  In 
this  form,  the  hypothesized  mean  is  subtracted  from  y  and  the  difference  is  divided 
by  sj  =  s/sfn.  Since  yi,  yi, ...  ,  yn  is  a  random  sample  from  N(fx,  a2),  the  random 
variables  y  and  s  are  independent.  We  will  see  an  analogous  characteristic  form  for 
the  7’2-statistic  in  the  multivariate  case  in  Section  5.3.2. 

5.3.2  Hotelling’s  T2-Test  for  Ho:  It  —  Mo  with  E  Unknown 

We  now  move  to  the  multivariate  case  in  which  p  variables  are  measured  on  each 
sampling  unit.  We  assume  that  a  random  sample  yi,  y2, . . .  ,  y„  is  available  from 
Np(fx ,  2),  where  y,  contains  the  p  measurements  on  the  /th  sampling  unit  (subject 
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or  object).  We  estimate  fx  by  y  and  X  by  S,  where  y  and  S  are  given  by  (3.16),  (3.19), 
(3.22),  (3.27),  and  (3.29).  In  order  to  test  Hq:  fx  =  /xq  versus  H\:  fx  fx  o,  we  use 
an  extension  of  the  univariate  /^-statistic  in  (5.3).  In  squared  form,  the  univariate  t  can 
be  rewritten  as 


f2  = 


7?  (v  -  Mo) 


=  n(y  -  fxo)(s2)  l(y-fx 0). 


(5.4) 


When  y  —  n  o  and  .v2  are  replaced  by  v  —  (xq  and  S,  we  obtain  the  test  statistic 


T2  =  77  (y-  fxo)'S  l(y-/xo). 


(5.5) 


Alternatively,  T 2  can  be  obtained  from  Z2  in  (5.2)  by  replacing  X  with  S. 

The  distribution  of  T2  was  obtained  by  Hotelling  (1931),  assuming  Hq  is  true 
and  sampling  is  from  N  p  (/x.  X ) .  The  distribution  is  indexed  by  two  parameters, 
the  dimension  p  and  the  degrees  of  freedom  v  =  n  —  1.  We  reject  Hq  if  T2  > 
T2  p  n_1  and  accept  Ho  otherwise.  Critical  values  of  the  redistribution  are  found  in 
Table  A.7,  taken  from  Kramer  and  Jensen  (1969a). 

Note  that  the  terminology  “accept  Ho ”  is  used  for  expositional  convenience  to 
describe  our  decision  when  we  do  not  reject  the  hypothesis.  Strictly  speaking,  we  do 
not  accept  Ho  in  the  sense  of  actually  believing  it  is  true.  If  the  sample  size  were 
extremely  large  and  we  accepted  Ho,  we  could  be  reasonably  certain  that  the  true  /x 
is  close  to  the  hypothesized  value  fxo.  Otherwise,  accepting  Ho  means  only  that  we 
have  failed  to  reject  Ho. 

The  7’2-statistic  can  be  viewed  as  the  sample  standardized  distance  between  the 
observed  sample  mean  vector  and  the  hypothetical  mean  vector.  If  the  sample  mean 
vector  is  notably  distant  from  the  hypothetical  mean  vector,  we  become  suspicious 
of  the  hypothetical  mean  vector  and  wish  to  reject  Ho- 

The  test  statistic  is  a  scalar  quantity,  since  T2  —  n(y  —  fxo)'S~ 1  (v  —  (xq)  is 
a  quadratic  form.  As  with  the  ^-distribution  of  Z2,  the  density  of  7' 2  is  skewed 
because  the  lower  limit  is  zero  and  there  is  no  upper  limit. 

The  characteristic  form  of  the  7’2-statistic  (5.5)  is 

T2  =  (y  -  MO)'  (y  -  Mo)-  (5.6) 


The  characteristic  form  has  two  features: 


1.  S/77  is  the  sample  covariance  matrix  of  y  and  serves  as  a  standardizing  matrix 
in  the  distance  function. 

2.  Since  yi,  y2,  ■  ■ .  ,  V«  are  distributed  as  N p ( ix ,  X),  it  follows  that  y  is  Np(ix, 
XX),  (n  —  I  )S  is  W (77  —  1 ,  X),  and  y  and  S  are  independent  (see  Section  4.3.2). 
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In  (5.3),  the  univariate  /-statistic  represents  the  number  of  standard  deviations  y 
is  separated  from  /iq.  In  appearance,  the  7'2-statistic  (5.6)  is  similar,  but  no  such 
simple  interpretation  is  possible.  If  we  add  a  variable,  the  distance  in  (5.6)  increases. 
(By  analogy,  the  hypotenuse  of  a  right  triangle  is  longer  than  either  of  the  legs.) 
Thus  we  need  a  test  statistic  that  indicates  the  significance  of  the  distance  from  y 
to  fM),  while  allowing  for  the  number  of  dimensions  (see  comment  3  at  the  end  of 
this  section  about  the  7’2-table).  Since  the  resulting  7’2-statistic  cannot  be  readily 
interpreted  in  terms  of  the  number  of  standard  deviations  y  is  from  /xq,  we  do  not 
have  an  intuitive  feel  for  its  significance  as  we  do  with  the  univariate  t.  We  must 
compare  the  calculated  value  of  T 2  with  the  table  value.  In  addition,  the  7'2-table 
provides  some  insights  into  the  behavior  of  the  7’2-distribution.  Four  of  these  insights 
are  noted  at  the  end  of  this  section. 

If  a  test  leads  to  rejection  of  Hq  :  fx  —  fxo,  the  question  arises  as  to  which  variable 
or  variables  contributed  most  to  the  rejection.  This  issue  is  discussed  in  Section  5.5 
for  the  two-sample  7’2-test  of  77o:  pi  —  fX2,  and  the  results  there  can  be  easily 
adapted  to  the  one-sample  test  of  77o:  fx  —  ixq.  For  confidence  intervals  on  the 
individual  p,f s  in  fx,  see  Rencher  (1998,  Section  3.4). 

The  following  are  some  key  properties  of  the  7'2-test: 

1.  We  must  have  n  —  1  >  p.  Otherwise,  S  is  singular  and  T2  cannot  be  computed. 

2.  In  both  the  one-sample  and  two-sample  cases,  the  degrees  of  freedom  for  the 
7’2-statistic  will  be  the  same  as  for  the  analogous  univariate  /-test;  that  is,  v  = 
n  —  1  for  one  sample  and  v  =  n\  +  ri2  —  2  for  two  samples  (see  Section  5.4.2). 

3.  The  alternative  hypothesis  is  two-sided.  Because  the  space  is  multidimen¬ 
sional,  we  do  not  consider  one-sided  alternative  hypotheses,  such  as  /x  >  /xq. 
However,  even  though  the  alternative  hypothesis  77i :  fx  ^  /xq  is  essentially 
two-sided,  the  critical  region  is  one-tailed  (we  reject  Hq  for  large  values  of 
T2).  This  is  typical  of  many  multivariate  tests. 

4.  In  the  univariate  case,  t2  (  =  F\n-  i .  The  statistic  T2  can  also  be  converted  to 
an  /’’-statistic  as  follows: 


V~P+K  2 

vp  1p,v~rPX’-p+ 1- 


(5.7) 


Note  that  the  dimension  p  (number  of  variables)  of  the  T2 -statistic  becomes 
the  first  of  the  two  degrees-of-freedom  parameters  of  the  F.  The  number  of 
degrees  of  freedom  for  T 2  is  denoted  by  v,  and  the  F  transformation  is  given 
in  terms  of  a  general  v,  since  other  applications  of  T 2  will  have  v  different 
from  n  —  1  (see,  for  example.  Sections  5.4.2  and  6.3.2). 


Equation  (5.7)  gives  an  easy  way  to  find  critical  values  for  the  7’2-test.  However,  we 
have  provided  critical  values  of  T 2  in  Table  A. 7  because  of  the  insights  they  provide 
into  the  behavior  of  the  T 2 -distribution  in  particular  and  multivariate  tests  in  general. 
The  following  are  some  insights  that  can  readily  be  gleaned  from  the  7’2-tables: 
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1.  The  first  column  of  Table  A.7  contains  squares  of  /-table  values;  that  is, 
T2  x  v  =  t^j2  v.  (We  use  t2^  because  the  univariate  test  of  Hq:  h  —  /xo  vs. 
H\\  [i  ^  flu  is  two-tailed.)  Thus  for  p  —  I .  7' 2  reduces  to  t2.  This  can  easily 
be  seen  by  comparing  (5.5)  with  (5.4). 

2.  The  last  row  of  each  page  of  Table  A.7  contains  y2  critical  values,  that  is, 
Tp  oo  —  Xp-  Thus  as  n  increases,  S  approaches  X.  and 

T2  =  «(y-  Mo)'S_I(y-  mo) 

approaches  Z2  =  n(y  —  fio )'X_1(y  —  Mo)  in  (5.2),  which  is  distributed  as  Xp- 

3.  The  values  increase  along  each  row  of  Table  A.7;  that  is,  for  a  fixed  v,  the 
critical  value  T2  v  increases  with  p.  It  was  noted  above  that  in  any  given 
sample,  the  calculated  value  of  T2  increases  if  a  variable  is  added.  However, 
since  the  critical  value  also  increases,  a  variable  should  not  be  added  unless  it 
adds  a  significant  amount  to  T2. 

4.  As  p  increases,  larger  values  of  v  are  required  for  the  distribution  of  7' 2  to 

approach  y2 .  In  the  univariate  case,  /  in  (5.3)  is  considered  a  good  approxima¬ 
tion  to  the  standard  normal  z  in  (5.1 )  when  v  =  n  —  1  is  at  least  30.  In  the  first 
column  ( p  —  1)  of  Table  A.7,  we  see  725  l  3Q  =  4.171  and  T’qs  i  oo  =  3.841, 
with  a  ratio  of  4.171/3.841  =  1.086.  For  p  =  5,  v  must  be  100  to  obtain 
the  same  ratio:  Tq5  5  10o/^o5  5  oo  =  1-086.  For  p  —  10,  we  need  v  =  200 
to  obtain  a  similar  value  of  the  ratio:  T^5  |{)  ->0o/-^05  io  oo  =  1-076.  Thus  one 
must  be  very  cautious  in  stating  that  T2  has  an  approximate  / 2 -distribution 
for  large  n.  The  a  level  (Type  I  error  rate)  could  be  substantially  inflated.  For 
example,  suppose  p  =  10  and  we  assume  that  n  =  30  is  sufficiently  large  for 
a  x 2 -approximation  to  hold.  Then  we  would  reject  Hq  for  T2  >  18.307  with 
a  target  cr -level  of  .05.  However,  the  correct  critical  value  is  34.044,  and  the 
misuse  of  18.307  would  yield  an  actual  a  of  P(T20  >  18.307)  =  .314. 


Example  5.3.2.  In  Table  3.3  we  have  n  =  10  observations  on  p  —  3  variables. 
Desirable  levels  for  y i  and  yi  are  15.0  and  6.0,  respectively,  and  the  expected  level 
of  y3  is  2.85.  We  can,  therefore,  test  the  hypothesis 


Hq:  m  = 


15.0  \ 
6.0 

2.85  / 


In  Examples  3.5  and  3.6,  y  and  S  were  obtained  as 


28.1  \ 

1 

<  140.54 

49.68 

1.94 

7.18 

,  S  = 

49.68 

72.25 

3.68 

3.09  / 

\ 

V  1.94 

3.68 

.25 
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To  test  Hq,  we  use  (5.5): 


T1  =  n( y-j*o)'S  (y  -  Mo) 


=  10 


28.1  - 

15.0  V 

/  140.54 

49.68 

1.94  > 

\  'Z28-1  - 

15.0 

7.18  - 

6.0 

49.68 

72.25 

3.68 

718  - 

6.0 

3.09  - 

2.85  ) 

1  1.94 

3.68 

.25  , 

1  V  3.09  - 

2.85 

=  24.559. 


From  Table  A. 7,  we  obtain  the  critical  value  Tq5  3  g  =  16.766.  Since  the  observed 
value  of  7' 2  exceeds  the  critical  value,  we  reject  the  hypothesis.  □ 


5.4  COMPARING  TWO  MEAN  VECTORS 

We  hist  review  the  univariate  two-sample  /-test  and  then  proceed  with  the  analogous 
multivariate  test. 


5.4.1  Review  of  Univariate  Two-Sample  t- Test 

In  the  one-variable  case  we  obtain  a  random  sample  yn,  yi2, ...  ,  yini  from 
and  a  second  random  sample  y2i,  >’22,  •  •  ■  ,  yin2  from  /V(m 2.  cry). 
We  assume  that  the  two  samples  are  independent  and  that  a 2  =  cry  =  a2, 
say,  with  <7 2  unknown.  [The  assumptions  of  independence  and  equal  variances 
are  necessary  in  order  for  the  7-statistic  in  (5.8)  to  have  a  /-distribution.]  From 
the  two  samples  we  calculate  yj,  y2,  SSi  =  2Z;=i(ji<  —  yj)2  =  («i  —  l)s2, 

SS2  =  (y2i  ~  lir  =  («2  —  1)^| ,  and  the  pooled  variance 

2  SS1  +  SS2  (ni  —  l)s2  +  (n2  —  l)i| 
pl  m  +U2  —  2  n\  +17  2  —  2 

where  n  \  +  «2  —  2  is  the  sum  of  the  weights  iti  —  1  and  n2  —  1  in  the  numerator.  With 

this  denominator,  s2j  is  an  unbiased  estimator  for  the  common  variance,  a2,  that  is, 

E(s^)  =  a2. 

To  test 

Ho:  m  —  /X2  vs.  77]  :  mi  /  M2, 

we  use 


Vi  -  y_i 

H  T’ 


(5.8) 
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which  has  a  r-distribution  with  n\  +  «2  —  2  degrees  of  freedom  when  Ho  is  true.  We 
therefore  reject  Ho  if  \t\  >  fa/2,ni+n2-2- 

Note  that  (5.8)  exhibits  the  characteristic  form  of  a  ^-statistic.  In  this  form,  the 
denominator  is  the  sample  standard  deviation  of  the  numerator;  that  is, 

■Spi-v/l/ni  +  l/«2 


is  an  estimate  of 


ay ^  =  Vvar^j  -  y2)  = 


5.4.2  Multivariate  Two-Sample  J2-Test 

We  now  consider  the  case  where  p  variables  are  measured  on  each  sampling  unit  in 
two  samples.  We  wish  to  test 

Ho  -  p l  =  M2  vs.  H\  \  f  p.2 

We  obtain  a  random  sample  yn,  yi2, ....  yini  from  N p (/jl\  .  Xi)  and  a  second  ran¬ 
dom  sample  y2i,  y22, . .  ■  ,  y2 «2  from  Np{fji2,  X2X  We  assume  that  the  two  samples 
are  independent  and  that  Xi  =  22  =  X,  say,  with  X  unknown.  These  assumptions 
are  necessary  in  order  for  the  '/  ^-statistic  in  (5.9)  to  have  a  T  -distribution.  A  test  of 
Hq'-  Xi  =  X2  is  given  in  Section  7.3.2.  For  an  approximate  test  of  /7q  :  =  /jl 2 

that  can  be  used  when  Xi  f  X2,  see  Rencher  (1998,  Section  3.9). 

The  sample  mean  vectors  are  yt  =  ff'/L  1  yi//«l  and  y2  =  YllL  1  Y2i/n2-  Define 
Wi  and  W2  to  be  the  matrices  of  sums  of  squares  and  cross  products  for  the  two 
samples: 


«i 

Wi  =  J](yi/  -  yi)(yi;  -  yi)'  =  (m  -  i)Si, 

i= 1 

W2  =  ^(y2t  -  y2)(y2;  -  y2Y  =  (>n  -  i)S2. 

i=t 

Since  (n  1  —  1  )Si  is  an  unbiased  estimator  of  (n  1  —  1)X  and  {112  —  1)S2  is  an  unbiased 
estimator  of  {no  —  1)X,  we  can  pool  them  to  obtain  an  unbiased  estimator  of  the 
common  population  covariance  matrix,  X: 

„  1 
pl  n  1  +  «2  —  2 


(Wi  +  W2) 
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- - - -[(m  -  l)Si  +  (n2  -  1  )S2] • 

III  +112-2 


Thus  E(Spi)  =  X. 

The  square  of  the  univariate  ^-statistic  (5.8)  can  be  expressed  as 
i~\  n  i  no  oi 

f  =  — -—Or  -  >'2)(-s,pi)  Or  -  y2)- 

n  i  +  n2  p 

This  can  be  generalized  to  p  variables  by  substituting  jj  —  y2  for  y  j  —  y2  and  Spi  for 
ipj  to  obtain 


T 2 


n  t«2 
«l  +  n2 


(yi  -y2),spl1(yi  -y2). 


(5.9) 


which  is  distributed  as  T2 3 4  ni+n^_2  when  /7q  :  /j.\  —  /jl2  is  true.  To  carry  out  the 
test,  we  collect  the  two  samples,  calculate  T 2  by  (5.9),  and  reject  Hq  if  T 2  > 
T2  .  t  Critical  values  of  T2  are  found  in  Table  A. 7.  For  tables  of  the  power 
of  the  72-test  (probability  of  rejecting  Ho  when  it  is  false)  and  illustrations  of  their 
use,  see  Rencher  (1998,  Section  3.10). 

The  T2-statistic  (5.9)  can  be  expressed  in  characteristic  form  as  the  standardized 
distance  between  yx  and  y2: 


T2  =  (yi  -  y2/ 


-1 


Spi 


(Ft  -  y2). 


(5.10) 


where  (1  /n\  +  l/«2)Spi  is  the  sample  covariance  matrix  for  yj  —  y2  and  Spi  is 
independent  of  y,  —  y2  because  of  sampling  from  the  multivariate  normal.  For  a 
discussion  of  robustness  of  T2  to  departures  from  the  assumptions  of  multivariate 
normality  and  homogeneity  of  covariance  matrices  (Si  =  X2),  see  Rencher  (1998, 
Section  3.7). 

Some  key  properties  of  the  two-sample  7'2-tcst  are  given  in  the  following  list: 


1.  It  is  necessary  that  n  i  +  n2  —  2  >  p  for  Spi  to  be  nonsingular. 

2.  The  statistic  T2  is,  of  course,  a  scalar.  The  3 p  +  p(p  —  l)/2  quantities  in 

y1(  y2,  and  Spi  have  been  reduced  to  a  single  scale  on  which  7’ 2  is  large  if 
the  sample  evidence  favors  77j :  f  /xi  and  small  if  the  evidence  supports 

Hq\  fX\  =  fx.2\  we  reject  Hq  if  the  standardized  distance  between  yj  and  y2  is 
large. 

3.  Since  the  lower  limit  of  T 2  is  zero  and  there  is  no  upper  limit,  the  density 
is  skewed.  In  fact,  as  noted  in  (5.11),  7' 2  is  directly  related  to  F .  which  is  a 
well-known  skewed  distribution. 

4.  For  degrees  of  freedom  of  7’ 2  we  have  n  i  +  n2  —  2,  which  is  the  same  as  for 
the  corresponding  univariate  f-statistic  (5.8). 
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5.  The  alternative  hypothesis  H\ :  p,\  ^  /ut2  is  two  sided.  The  critical  region 
T 2  >  T~  is  one-tailed,  however,  as  is  typical  of  many  multivariate  tests. 

6.  The  T2-statistic  can  be  readily  transformed  to  an  /-’-statistic  using  (5.7): 


n  i  +  n2  —  P  ~  lr2 
(n  i  +  in  -2 )p 


=  F, 


p,ni+ti2-p-  \  - 


(5.11) 


where  again  the  dimension  p  of  the  7'2-statistic  becomes  the  first  degree-of- 
freedom  parameter  for  the  /^-statistic. 


Example  5.4.2.  Four  psychological  tests  were  given  to  32  men  and  32  women.  The 
data  are  recorded  in  Table  5.1  (Beall  1945).  The  variables  are 


y  i  =  pictorial  inconsistencies  3-3  =  tool  recognition 

y2  =  paper  form  board  V4  =  vocabulary 


The  mean  vectors  and  covariance  matrices  of  the  two  samples  are 


( 

15.97 

\ 

/  12.34 

\ 

15.91 

13.91 

yi  = 

27.19 

' 

y2  = 

16.66 

\ 

22.75 

) 

v  21.94 

/ 

5.192 

4.545 

6.522 

5.250 

\ 

4.545 

13.18 

6.760 

6.266 

Si 

6.522 

6.760 

28.67 

14.47 

\ 

5.250 

6.266 

14.47 

16.65 

/ 

f 

9.136 

7.549 

4.864 

4.151 

\ 

5 

7.549 

18.60 

10.22 

5.446 

4.864 

10.22 

30.04 

13.49 

\ 

4.151 

5.446 

13.49 

28.00 

/ 

The  sample  covariance  matrices  do  not  appear  to  indicate  a  disparity  in  the  popu¬ 
lation  covariance  matrices.  (A  significance  test  to  check  this  assumption  is  carried 
out  in  Example  7.3.2,  and  the  hypothesis  Hq:  Xi  =  X2  is  not  rejected.)  The  pooled 
covariance  matrix  is 


SPi  = 


32  +  32  - 

-2[(32 

+ 

1 

(32- 

/  7.164 

6.047 

5.693 

4.701 

6.047 

15.89 

8.492 

5.856 

5.693 

8.492 

29.36 

13.98 

v  4.701 

5.856 

13.98 

22.32 

1)S2] 

\ 

/ 
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Table  5.1.  Four  Psychological  Test  Scores  on  32  Males  and  32  Females 


Males 

Females 

yi 

V2 

T3 

T4 

yi 

yi 

T3 

T4 

15 

17 

24 

14 

13 

14 

12 

21 

17 

15 

32 

26 

14 

12 

14 

26 

15 

14 

29 

23 

12 

19 

21 

21 

13 

12 

10 

16 

12 

13 

10 

16 

20 

17 

26 

28 

11 

20 

16 

16 

15 

21 

26 

21 

12 

9 

14 

18 

15 

13 

26 

22 

10 

13 

18 

24 

13 

5 

22 

22 

10 

8 

13 

23 

14 

7 

30 

17 

12 

20 

19 

23 

17 

15 

30 

27 

11 

10 

11 

27 

17 

17 

26 

20 

12 

18 

25 

25 

17 

20 

28 

24 

14 

18 

13 

26 

15 

15 

29 

24 

14 

10 

25 

28 

18 

19 

32 

28 

13 

16 

8 

14 

18 

18 

31 

27 

14 

8 

13 

25 

15 

14 

26 

21 

13 

16 

23 

28 

18 

17 

33 

26 

16 

21 

26 

26 

10 

14 

19 

17 

14 

17 

14 

14 

18 

21 

30 

29 

16 

16 

15 

23 

18 

21 

34 

26 

13 

16 

23 

24 

13 

17 

30 

24 

2 

6 

16 

21 

16 

16 

16 

16 

14 

16 

22 

26 

11 

15 

25 

23 

17 

17 

22 

28 

16 

13 

26 

16 

16 

13 

16 

14 

16 

13 

23 

21 

15 

14 

20 

26 

18 

18 

34 

24 

12 

10 

12 

9 

16 

15 

28 

27 

14 

17 

24 

23 

15 

16 

29 

24 

13 

15 

18 

20 

18 

19 

32 

23 

11 

16 

18 

28 

18 

16 

33 

23 

7 

7 

19 

18 

17 

20 

21 

21 

12 

15 

7 

28 

19 

19 

30 

28 

6 

5 

6 

13 

By  (5.9),  we  obtain 

T2  =  _^_(y  y  )'s- ’(yj  -y2)  =  97.6015. 
«1  +  «2  P 


From  interpolation  in  Table  A. 7,  we  obtain  T^n  4  62  =  15.3  73,  and  we  therefore 
reject  Ho :  fii  —  /J-2-  See  Example  5.5  for  a  discussion  of  which  variables  contribute 
most  to  separation  of  the  two  groups.  □ 
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5.4.3  Likelihood  Ratio  Tests 

The  maximum  likelihood  approach  to  estimation  was  introduced  in  Section  4.3.1.  As 
noted  there,  the  likelihood  function  is  the  joint  density  of  yi,  y2, . . .  ,  y„ .  The  values 
of  the  parameters  that  maximize  the  likelihood  function  are  the  maximum  likelihood 
estimators. 

The  likelihood  ratio  method  of  test  construction  uses  the  ratio  of  the  maximum 
value  of  the  likelihood  function  assuming  Hq  is  true  to  the  maximum  under  H\ , 
which  is  essentially  unrestricted.  Likelihood  ratio  tests  usually  have  good  power  and 
sometimes  have  optimum  power  over  a  wide  class  of  alternatives. 

When  applied  to  multivariate  normal  samples  and  Hq  :  /x\  =  /x2,  the  likelihood 
ratio  approach  leads  directly  to  Hotelling’s  7'2-tcst  in  (5.9).  Similarly,  in  the  one- 
sample  case,  the  r2-statistic  in  (5.5)  is  the  likelihood  ratio  test.  Thus  the  7’2-test, 
which  we  introduced  rather  informally,  is  the  best  test  according  to  certain  criteria. 


5.5  TESTS  ON  INDIVIDUAL  VARIABLES  CONDITIONAL  ON 
REJECTION  OF  770  BY  THE  T2-TEST 


If  the  hypothesis  Hq  :  /x  \  =  fx 2  is  rejected,  the  implication  is  that  ji  \  /  ^  n2j  for 
at  least  one  j  =  1,2,...  ,  p.  But  there  is  no  guarantee  that  Hq  :  /i 1  j  —  n 2j  will  be 
rejected  for  some  j  by  a  univariate  test.  However,  if  we  consider  a  linear  combination 
of  the  variables,  z  =  a'y,  then  there  is  at  least  one  coefficient  vector  a  for  which 

Ka)  =  ,  (5-12) 

y  (i/«i  +  i/«2  )•*? 


will  reject  the  corresponding  hypothesis  Hq:  /x - ,  =  /xZ2  or  Ho:  a'/x ]  =  a' /x2 •  By 
(3.54),  zi  —  a'yj  and  z2  =  a'y2,  and  from  (3.55)  the  variance  estimator  si  is  the 
pooled  estimator  a'Spia.  Thus  (5.12)  can  be  written  as 


1(a)  = 


a'yi  -  a'y2 
+n2)/«i/(2]a,Spia 


(5.13) 


Since  t{ a)  can  be  negative,  we  work  with  / 2 ( a ) .  The  linear  function  z  =  a'y 
is  a  projection  of  y  onto  a  line  through  the  origin.  We  seek  the  line  (direction)  on 
which  the  difference  y,  —  y2  is  maximized  when  projected.  The  projected  difference 
a'(yt  —  y2)  [standardized  by  a'Spia  as  in  (5.13)]  will  be  less  in  any  other  direction 
than  that  parallel  to  the  line  joining  y;  and  y2.  The  value  of  a  that  projects  onto  this 
line,  or,  equivalently,  maximizes  t2( a)  in  (5.13),  is  (any  multiple  of) 

a  =  Spi'ly,  -y2).  (5.14) 

Since  a  in  (5.14)  projects  yt  —  y2  onto  a  line  parallel  to  the  line  joining  v |  and  y2, 
we  would  expect  that  r 2 ( a )  =  7'2,  and  this  is  indeed  the  case  (see  Problem  5.3). 
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When  a  =  S^j 1  (v  |  —  y2),  then  z  =  a'y  is  called  the  discriminant  function.  Some¬ 
times  the  vector  a  itself  in  (5.14)  is  loosely  referred  to  as  the  discriminant  function. 

If  Hq:  /jl i  —  /j-2  is  rejected  by  T2  in  (5.9),  the  discriminant  function  a'y  will 
lead  to  rejection  of  Hq  :  a' /jl \  =  apu  using  (5.13),  with  a  =  S”,1  (yj  —  y2).  We  can 
then  examine  each  a/  in  a  for  an  indication  of  the  contribution  of  the  corresponding 
y j  to  rejection  of  Hq.  This  follow-up  examination  of  each  a,  should  be  done  only 
if  Hq  :  fii  =  fX2  is  rejected  by  T2.  The  discriminant  function  will  appear  again  in 
Section  5.6.2  and  in  Chapters  8  and  9. 

We  list  these  and  other  procedures  that  could  be  used  to  check  each  variable  fol¬ 
lowing  rejection  of  Hq  by  a  two-sample  T-test: 


1.  Univariate  f-tests,  one  for  each  variable, 


tj  ~ 


>’U  -  ?2 j 


-/[(«!  +  ni)/n\n2\sjj  ’ 


j  =  1,2, ...  ,p. 


(5.15) 


where  Sjj  is  the  yth  diagonal  element  of  Spi.  Reject  Hq:  /i \ j  —  fi2j  if  \tj\  > 
ta/l.in+m-2-  For  conhdence  intervals  on  //  ] ;  —  //2/,  see  Rencher  (1998,  Sec¬ 
tion  3.6). 

2.  To  adjust  the  a-level  resulting  from  performing  the  p  tests  in  (5.15),  we  could 
use  a  Bonferroni  critical  value  ta/2p,m+n2-2  f°r  (5.15)  (Bonferroni  1936).  A 
critical  value  ta/2p  is  much  greater  than  the  corresponding  ta/2,  and  the  result¬ 
ing  overall  «- level  is  conservative.  Bonferroni  critical  values  ta/2p,v  are  given 
in  Table  A.8,  from  Bailey  (1977). 

3.  Another  critical  value  that  could  be  used  with  (5.15)  is  Tu  p  n]+n2^2,  where  Ta 
is  the  square  root  of  T2  from  Table  A.7;  that  is,  Ta,pjtl+n2-2  =  JTa,p.ni+n2-2- 
This  allows  for  all  p  variables  to  be  tested  as  well  as  all  possible  linear  com¬ 
binations,  as  in  (5.13),  even  linear  combinations  chosen  after  seeing  the  data. 
Consequently,  the  use  of  Ta  is  even  more  conservative  than  using  tu/2p:  that 

IS,  T^  p  n^fi^— 2  t>  tc[/2p,n\+ii2—2- 

4.  Partial  F-  or  f-tests  [test  of  each  variable  adjusted  for  the  other  variables;  see 
(5.32)  in  Section  5.8] 

5.  Standardized  discriminant  function  coefficients  (see  Section  8.5) 

6.  Correlations  between  the  variables  and  the  discriminant  function  (see  Sec¬ 
tion  8.7.3) 

7.  Stepwise  discriminant  analysis  (see  Section  8.9) 


The  first  three  methods  are  univariate  approaches  that  do  not  use  covariances  or 
correlations  among  the  variables  in  the  computation  of  the  test  statistic.  The  last  four 
methods  are  multivariate  in  the  sense  that  the  correlation  structure  is  explicitly  taken 
into  account  in  the  computation. 

Method  6,  involving  the  correlation  between  each  variable  and  the  discriminant 
function,  is  recommended  in  many  texts  and  software  packages.  However,  Rencher 
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(1988)  has  shown  that  these  correlations  are  proportional  to  individual  /-  or  /-’-tests 
(see  Section  8.7.3).  Thus  this  method  is  equivalent  to  method  1  and  is  a  univariate 
rather  than  a  multivariate  approach.  Method  7  is  often  used  to  identify  a  subset  of 
important  variables  or  even  to  rank  the  variables  according  to  order  of  entry.  But 
Rencher  and  Larson  (1980)  have  shown  that  stepwise  methods  have  a  high  risk  of 
selecting  spurious  variables,  unless  the  sample  size  is  very  large. 

We  now  consider  the  univariate  procedures  1 , 2,  and  3.  The  probability  of  rejecting 
one  or  more  of  the  p  univariate  tests  when  Hq  is  true  is  called  the  overall  a  or 
experimentwise  error  rate.  If  we  do  univariate  tests  only,  with  no  7’2-test,  then  the 
tests  based  on  ta/ 2P  and  Ta  in  procedures  2  and  3  are  conservative  (overall  a  too 
low),  and  tests  based  on  ta/2  in  procedure  1  are  liberal  (overall  a  too  high).  However, 
when  these  tests  are  carried  out  only  after  rejection  by  the  7’2-test  (such  tests  are 
sometimes  called  protected  tests),  the  experimentwise  error  rates  change.  Obviously 
the  tests  will  reject  less  often  (under  Hq)  if  they  are  carried  out  only  if  7’ 2  rejects. 
Thus  the  tests  using  ta/2p  and  Ta  become  even  more  conservative,  and  the  test  using 
ta/i  becomes  more  acceptable. 

Hummel  and  Sligo  (1971)  studied  the  experimentwise  error  rate  for  univariate  /- 
tests  following  rejection  of  Hq  by  the  7’ 2 -test  (protected  tests).  Using  a  =  .05,  they 
found  that  using  ta/ 2  for  a  critical  value  yields  an  overall  a  acceptably  close  to  the 
nominal  .05.  In  fact,  it  is  slightly  conservative,  making  this  the  preferred  univariate 
test  (within  the  limits  of  their  study).  They  also  compared  this  procedure  with  that  of 
performing  univariate  tests  without  a  prior  7' 2 -test  (unprotected  tests).  For  this  case, 
the  overall  a  is  too  high,  as  expected.  Table  5.2  gives  an  excerpt  of  Hummel  and 
Sligo’s  results.  The  sample  size  is  for  each  of  the  two  samples;  the  r2  in  common  is 
for  every  pair  of  variables. 

Hummel  and  Sligo  therefore  recommended  performing  the  multivariate  7’2-test 
followed  by  univariate  /-tests.  This  procedure  appears  to  have  the  desired  overall 
a  level  and  will  clearly  have  better  power  than  tests  using  Ta  or  ta/2p  as  a  critical 
value.  Table  5.2  also  highlights  the  importance  of  using  univariate  /-tests  only  if 
the  multivariate  7'2-test  is  significant.  The  inflated  a’s  resulting  if  /-tests  are  used 
without  regard  to  the  outcome  of  the  7’2-test  are  clearly  evident.  Thus  among  the 
three  univariate  procedures  (procedures  1 ,  2,  and  3),  the  first  appears  to  be  preferred. 

Among  the  multivariate  approaches  (procedures  4,  5,  and  7),  we  prefer  the  fifth 
procedure,  which  compares  the  (absolute  value  of)  coefficients  in  the  discriminant 
function  to  find  the  effect  of  each  variable  in  separating  the  two  groups  of  obser¬ 
vations.  These  coefficients  will  often  tell  a  different  story  from  the  univariate  tests, 
because  the  univariate  tests  do  not  take  into  account  the  correlations  among  the  vari¬ 
ables  or  the  effect  of  each  variable  on  T2  in  the  presence  of  the  other  variables.  A 
variable  will  typically  have  a  different  effect  in  the  presence  of  other  variables  than 
it  has  by  itself.  In  the  discriminant  function  z  =  a'y  =  a\y\  +  a2y2  +  •  •  •  +  apyp, 
where  a  =  S^j1  (yj  —  y2),  the  coefficients  a\ ,  02,  ■  ■  ■  ,  ap  indicate  the  relative  impor¬ 
tance  of  the  variables  in  a  multivariate  context,  something  the  univariate  /-tests  can¬ 
not  do.  If  the  variables  are  not  commensurate  (similar  in  scale  and  variance),  the 
coefficients  should  be  standardized,  as  in  Section  8.5;  this  allows  for  more  valid 
comparisons  among  the  variables.  Rencher  and  Scott  (1990)  provided  a  decomposi- 
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Table  5.2. 

Comparison  of  Experimentwise  Error  Rates  (Nominal  a 

=  .05) 

Sample 

Size 

Number  of 
Variables 

Common  r 

2 

.10 

.30 

.50 

.70 

Univariate  Tests  Only 

10 

3 

.145 

.112 

.114 

.077 

10 

6 

.267 

.190 

.178 

.111 

10 

9 

.348 

.247 

.209 

.129 

30 

3 

.115 

.119 

.117 

.085 

30 

6 

.225 

.200 

.176 

.115 

30 

9 

.296 

.263 

.223 

.140 

50 

3 

.138 

.124 

.102 

.083 

50 

6 

.230 

.190 

.160 

.115 

50 

9 

.324 

.258 

.208 

.146 

Multivariate  Test  Followed  by  Univariate  Testsb 

10 

3 

.044 

.029 

.035 

.022 

10 

6 

.046 

.029 

.030 

.017 

10 

9 

.050 

.026 

.025 

.018 

30 

3 

.037 

.044 

.029 

.025 

30 

6 

.037 

.037 

.032 

.021 

30 

9 

.042 

.042 

.030 

.021 

50 

3 

.038 

.041 

.033 

.028 

50 

6 

.037 

.039 

.028 

.027 

50 

9 

.036 

.038 

.026 

.020 

“Ignoring  multivariate  tests. 

6  Carried  out  only  if  multivariate  test  rejects. 


tion  of  the  information  in  the  standardized  discriminant  function  coefficients.  For  a 
detailed  analysis  of  the  effect  of  each  variable  in  the  presence  of  the  other  variables, 
see  Rencher  (1993;  1998,  Sections  3.3.5  and  3.5.3). 

Example  5.5.  For  the  psychological  data  in  Table  5.1,  we  obtained  yx,  y2,  and  Spi 
in  Example  5.4.2.  The  discriminant  function  coefficient  vector  is  obtained  from 
(5.14)  as 


/ 

a  =  spi1(yi  -72)  = 


.5104  \ 
-.2033 
.4660 


\  -.3097  / 


Thus  the  linear  combination  that  best  separates  the  two  groups  is 
a'y  =  ,5104yi  —  .2033 y2  +  .4660>’3  —  -3097y4, 
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in  which  y\  and  >>3  appear  to  contribute  most  to  separation  of  the  two  groups.  (After 
standardization,  the  relative  contribution  of  the  variables  changes  somewhat;  see  the 
answer  to  Problem  8.7  in  Appendix  B.)  □ 

5.6  COMPUTATION  OF  T2 

If  one  has  a  program  available  with  matrix  manipulation  capability,  it  is  a  simple  mat¬ 
ter  to  compute  T 2  using  (5.9).  However,  this  approach  is  somewhat  cumbersome  for 
those  not  accustomed  to  the  use  of  such  a  programming  language,  and  many  would 
prefer  a  more  automated  procedure.  But  very  few  general-purpose  statistical  pro¬ 
grams  provide  for  direct  calculation  of  the  two-sample  T  -statistic,  perhaps  because 
it  is  so  easy  to  obtain  from  other  procedures.  We  will  discuss  two  types  of  widely 
available  procedures  that  can  be  used  to  compute  T2. 

5.6.1  Obtaining  T2  from  a  MANOYA  Program 

Multivariate  analysis  of  variance  (MANOVA)  is  discussed  in  Chapter  6,  and  the 
reader  may  wish  to  return  to  the  present  section  after  becoming  familiar  with  that 
material.  One-way  MANOVA  involves  a  comparison  of  mean  vectors  from  several 
samples.  Typically,  the  number  of  samples  is  three  or  more,  but  the  procedure  will 
also  accommodate  two  samples.  The  two-sample  T2  test  is  thus  a  special  case  of 
MANOVA. 

Four  common  test  statistics  are  defined  in  Section  6.1:  Wilks’  A,  the  Lawley- 
Hotelling  LJis\  Pillai’s  V(s- ,  and  Roy’s  largest  root  0.  Without  concerning  ourselves 
here  with  how  these  are  defined  or  calculated,  we  show  how  to  use  each  to  obtain  the 
two-sample  T2: 

T2  —  («!  +  n2  —  2)———,  (5.16) 

A 

T2  =  (m  +  n2  -2)UW,  (5.17) 

T2  =  {m  +  n2  -  2)  i  _  ,  (5.18) 

T~  —  (n !  +  n2  —  2)—^—.  (5.19) 

1  —  U 

(For  the  special  case  of  two  groups,  V <-,s)  =  0.)  These  relationships  are  demonstrated 
in  Section  6.1.7.  If  the  MANOVA  program  gives  eigenvectors  of  E-1H  (E  and  H 
are  defined  in  Section  6. 1 .2),  the  eigenvector  corresponding  to  the  largest  eigenvalue 
will  be  equal  to  (a  constant  multiple  of)  the  discriminant  function  S^1  (y j  —  y2). 

5.6.2  Obtaining  T 2  from  Multiple  Regression 

In  this  section,  the  y’s  become  independent  variables  in  a  regression  model.  For  each 
observation  vector  yi,  and  y 2/  in  a  two-sample  T2,  define  a  "dummy”  group  variable 
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as 


«2 


Wj  —  - for  each  of  yi i ,  yi2, . . .  ,  yi,u  in  sample  1 

for  each  of  y2i,  >'22,  . . .  ,  yim  in  sample  2. 


n  1  +  n  2 
n\ 


n  1  +  «2 


Then  w  —  0  for  all  n  \  +  n2  observations.  The  prediction  equation  for  the  regression 
of  w  on  the  y’s  can  be  written  as 

u>i  =  bQ  +  biyn  +  b2yn  H - K  bpyip , 

where  i  ranges  over  all  n  \  +  n2  observations  and  the  least  squares  estimate  bo  is  [see 
(10.15)] 


bo  —  w  -  b\yl  -  b2y2 - bpyp. 

Substituting  this  into  the  regression  equation,  we  obtain 

u>i  =  w  +  b  1  (yn  -  yx)  +  b2(yi2  -  y2)  H - 1-  bp(yip  -  y p) 

=  b\ ( yn  ~  yi)  +  b2(yi2  —  y2)  H - h  bp(yip  -  yp)  (since  uT  =  0). 

Let  h'  =  (b\,  b2.  . . .  ,bp)  be  the  vector  of  regression  coefficients  and  R2  be  the 
squared  multiple  correlation.  Then  we  have  the  following  relationships: 

pi 

T2  =  (ni+n2-2)— (5.20) 

a  =  Spi'^1  -  y2 )  =  +  n2  -  2  +  r2)b.  (5.21) 

F  n\n,2 

Thus  with  ordinary  multiple  regression,  one  can  easily  obtain  T2  and  the  discrimi¬ 
nant  function  1  (v ]  —  y2).  We  simply  define  Wi  as  above  for  each  of  the  n\  +  n2 
observations,  regress  the  iu’s  on  the  y’s,  and  use  the  resulting  R2  in  (5.20).  For  b, 
delete  the  intercept  from  the  regression  coefficients  for  use  in  (5.21).  Actually,  since 
only  the  relative  values  of  the  elements  of  a  =  1  (yj  —  y2)  are  of  interest,  it  is  not 

necessary  to  convert  from  b  to  a  in  (5.21).  We  can  use  b  directly  or  standardize  the 
values  b\,b2, . . .  ,  bp  as  in  Section  8.5. 

Example  5.6.2.  We  illustrate  the  regression  approach  to  computation  of  7’ 2  using 
the  psychological  data  in  Table  5.1.  We  set  w  —  n2/(n\  +  n2)  —  ||  =  4  for  each 
observation  in  group  1  (males)  and  equal  to  —n\/{n\  +  n2)  —  —  \  in  the  second 
group  (females).  When  w  is  regressed  on  the  64  v’s,  we  obtain 
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b0  \ 

(  -.751  \ 

b\ 

.051 

t>2 

= 

-.020 

by 

.047 

b4  ) 

l  -  °31  / 

R2  =  .6115. 


By  (5.20), 


T~  =  (n\+n2-  2) 


R2  62(.6115) 


1  -  R2  1  —  .6115 


=  97.601, 


as  was  obtained  before  in  Example  5.4.2.  Note  that  b'  =  (b\ ,  b2,  by,  b4)  — 
(.051,  —.020,  .047,  —.031),  with  the  intercept  deleted,  is  proportional  to  the  dis¬ 
criminant  function  coefficient  vector  a  from  Example  5.5,  as  we  would  expect  from 
(5.21).  □ 


5.7  PAIRED  OBSERVATIONS  TEST 

As  usual,  we  begin  with  the  univariate  case  to  set  the  stage  for  the  multivariate  pre¬ 
sentation. 

5.7.1  Univariate  Case 

Suppose  two  samples  are  not  independent  because  there  exists  a  natural  pairing 
between  the  i  th  observation  y;  in  the  first  sample  and  the  i  th  observation  x,  in  the 
second  sample  for  all  i,  as,  for  example,  when  a  treatment  is  applied  twice  to  the 
same  individual  or  when  subjects  are  matched  according  to  some  criterion,  such  as 
IQ  or  family  background.  With  such  pairing,  the  samples  are  often  referred  to  as 
paired  obsen’ations  or  matched  pairs.  The  two  samples  thus  obtained  are  correlated, 
and  the  two-sample  test  statistic  in  (5.9)  is  not  appropriate  because  the  samples  must 
be  independent  in  order  for  (5.9)  to  have  a  f-distribution.  (The  two-sample  test  in 
(5.9)  is  somewhat  robust  to  heterogeneity  of  variances  and  to  lack  of  normality  but 
not  to  dependence.]  We  reduce  the  two  samples  to  one  by  working  with  the  differ¬ 
ences  between  the  paired  observations,  as  in  the  following  layout  for  two  treatments 
applied  to  the  same  subject: 


Difference 

Pair  Number  Treatment  1  Treatment  2  dj  =  y,  —  x,- 

1  yi  xi  di 

2  y2  x2  d2 

n  yn  x„  dn 

To  obtain  a  f-test,  it  is  not  sufficient  to  assume  individual  normality  for  each  of  y 
and  x.  To  allow  for  the  covariance  between  y  and  x,  we  need  the  additional  assump- 
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tion  that  y  and  x  have  a  bivariate  normal  distribution  with 

m  =  (  ^ ) ,  x  =  (  ay  a >;). 

\  !lx  )  \  °x  ) 

It  then  follows  by  property  la  in  Section  4.2  that  di  =  y,-  —  x,  is  IV (/rv  —  /x.Y,  cr?), 
where  crj  =  cr2  —  2ayx  +  cr2.  From  d\,  di,  ■  ■  ■  ,dn  we  calculate 


_  i  «  i  " 

d  —  —  ^  dt  and  sj  = - - 

i= 1  z=l 


■  c?r 


To  test  //o :  tiy  —  dx  >  that  is,  //q  :  //,/  =  0,  we  use  the  one-sample  statistic 


t  = 


(5.22) 


which  is  distributed  as  f„_i  if  Hq  is  true.  We  reject  Ho  in  favor  of  Hi  :  fid  /  0  if 
|f  |  >  ta/2,n-i-  It  is  not  necessary  to  assume  cr”  =  cr2  because  there  are  no  restrictions 
on  2. 

This  test  has  only  n  —  1  degrees  of  freedom  compared  with  2 (n  —  1)  for  the  two- 
independent-sample  /-test  (5.8).  In  general,  the  pairing  reduces  the  within-sample 
variation  Sd  and  thereby  increases  the  power. 

If  we  mistakenly  treated  the  two  samples  as  independent  and  used  (5.8)  with  n  i  = 
ii2  =  n,  we  would  have 

f  _  y  —  x  _  y-x 

However, 


( j  -  2E 

(n  -  l)s2  +  in  -  1H2 

cr 2  +  a  2 

W 

{n  +  n  —  2  )n 

n 

whereas  var(y  —  x)  =  (cr2  +  cr2  —  2 ayx)/n.  Thus  if  the  test  statistic  for  independent 
samples  (5.8)  is  used  for  paired  data,  it  does  not  have  a  /-distribution  and,  in  fact, 
underestimates  the  true  average  /-value  (assuming  Hq  is  false),  since  cr2  +  cr2  > 
cr2  +  cr2  —  2 ayx  if  ayx  >  0,  which  would  be  typical  in  this  situation.  One  could 
therefore  use 


t 


y  —  x 

yj(sj  +  V?  -  2 syx)/n 


(5.23) 


but  t  =  x/n  d/sd  in  (5.22)  is  equal  to  it  and  somewhat  simpler  to  use. 
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5.7.2  Multivariate  Case 

Here  we  assume  the  same  natural  pairing  of  sampling  units  as  in  the  univariate  case, 
but  we  measure  p  variables  on  each  sampling  unit.  Thus  y,  from  the  first  sample  is 
paired  with  x;  from  the  second  sample,  i  =  1,2,...  ,  n.  In  terms  of  two  treatments 
applied  to  each  sampling  unit,  this  situation  is  as  follows: 


Pair  Number  Treatment  1 


1  yi 

2  yi 


Treatment  2 

Xl 

x2 


Difference 


d, 


-  X; 


n 


y„ 


d 


n 


In  Section  5.7.1,  we  made  the  assumption  that  y  and  x  have  a  bivariate  normal 
distribution,  in  which  y  and  x  are  correlated.  Here  we  assume  y  and  x  are  correlated 
and  have  a  multivariate  normal  distribution: 


y 

X 


is 


M.V 

Mi 


To  test  Ho :  fi,i  —  0,  which  is  equivalent  to  Hq  :  fiy  =  fix  since  firj  =  E(\  —  x)  = 
fly  —  fix,  we  calculate 


1  n  i  n 

d  =  -Y>  and  Sd  = - -V(d/-d)(d;-d/ 

n  '  n  —  1  ~ 

i=l  i=l 


We  then  have 


T 2  =  d' 


d  =  /7dS(/1d. 


(5.24) 


Under  Hq ,  this  paired  comparison  7’2-statistic  is  distributed  as  T^n_v  We  reject  Hq 
if  T 2  >  T*  _  j .  Note  that  estimates  cov(y  -  x)  =  %yy  -  %yx  -  Xxy  +  S xx ,  for 
which  an  equivalent  estimator  would  be  Syy  —  SyA-  —  S.vy  +  Sxx  [see  (3.42)]. 

The  cautions  expressed  in  Section  5.7.1  for  univariate  paired  observation  data  also 
apply  here.  If  the  two  samples  of  multivariate  observations  are  correlated  because  of 
a  natural  pairing  of  sampling  units,  the  test  in  (5.24)  should  be  used  rather  than  the 
two-sample  7’2-test  in  (5.9),  which  assumes  two  independent  samples.  Misuse  of 
(5.9)  in  place  of  (5.24)  will  lead  to  loss  of  power. 

Since  the  assumption  Xyy  —  Xxx  is  not  needed  for  (5.24)  to  have  a  T2- 
distribution,  this  test  can  be  used  for  independent  samples  when  Si  ^  X2  (as 
long  as  «i  =  772).  The  observations  in  the  two  samples  would  be  paired  in  the  order 
they  were  obtained  or  in  an  arbitrary  order.  However,  in  the  case  of  independent 
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samples,  the  pairing  achieves  no  gain  in  power  to  offset  the  loss  of  n  —  1  degrees  of 
freedom. 

By  analogy  with  (5.14),  the  discriminant  function  coefficient  vector  for  paired 
observation  data  becomes 


a  =  S/d. 

For  tests  on  individual  variables,  we  have 


(5.25) 


0 


dj 

VsdJj/n’ 


j  =  1,2, ...  ,p. 


(5.26) 


The  critical  value  for  tj  is  ta/2p,n-i  or  ta/2,n-l  depending  on  whether  a  T2-test  is 
carried  out  first  (see  Section  5.5). 


Example  5.7.2.  To  compare  two  types  of  coating  for  resistance  to  corrosion,  15 
pieces  of  pipe  were  coated  with  each  type  of  coating  (Kramer  and  Jensen  1969b). 
Two  pipes,  one  with  each  type  of  coating,  were  buried  together  and  left  for  the  same 
length  of  time  at  15  different  locations,  providing  a  natural  pairing  of  the  observa¬ 
tions.  Corrosion  for  the  first  type  of  coating  was  measured  by  two  variables, 

}’i  =  maximum  depth  of  pit  in  thousandths  of  an  inch, 

\;2  =  number  of  pits, 


Table  5.3.  Depth  of  Maximum  Pits  and  Number  of  Pits  of  Coated  Pipes 


Location 

Coating  1 

Coating  2 

Difference 

Depth 

yi 

Number 

yi 

Depth 

Xl 

Number 

*2 

Depth 

dx 

Number 

d-2 

1 

73 

31 

51 

35 

22 

-4 

2 

43 

19 

41 

14 

2 

5 

3 

47 

22 

43 

19 

4 

3 

4 

53 

26 

41 

29 

12 

-3 

5 

58 

36 

47 

34 

11 

2 

6 

47 

30 

32 

26 

15 

4 

7 

52 

29 

24 

19 

28 

10 

8 

38 

36 

43 

37 

-5 

-1 

9 

61 

34 

53 

24 

8 

10 

10 

56 

33 

52 

27 
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with  xi  and  X2  defined  analogously  for  the  second  coating.  The  data  and  differences 
are  given  in  Table  5.3.  Thus  we  have,  for  example,  y'|  =  (73,  3 1),  x7j  =  (51,  35),  and 
dj  =  y7]  —  x7j  =  (22,  —4).  For  the  15  difference  vectors,  we  obtain 

-j  /  8.000  \  /  121.571  17.071  \ 

3.067  )’  17.071  21.781  )  ' 

By  (5.24), 

T2  =  (15X8.000, 3.067)  (  )  '  (  *  )  =  10.819. 

Since  T 2  =  10.819  >  Tq5  2  14  —  8.197,  we  reject  Ho:  fid  —  0  and  conclude  that 
the  two  coatings  differ  in  their  effect  on  corrosion.  □ 


5.8  TEST  FOR  ADDITIONAL  INFORMATION 


In  this  section,  we  are  again  considering  two  independent  samples,  as  in  Sec¬ 
tion  5.4.2.  We  start  with  a  basic  p  x  1  vector  y  of  measurements  on  each  sampling 
unit  and  ask  whether  a  q  x  1  subvector  x  measured  in  addition  to  y  (on  the  same 
unit)  will  significantly  increase  the  separation  of  the  two  samples  as  shown  by  T2. 
It  is  not  necessary  that  we  add  new  variables.  We  may  be  interested  in  determining 
whether  some  of  the  variables  we  already  have  are  redundant  in  the  presence  of  other 
variables  in  terms  of  separating  the  groups.  We  have  designated  the  subset  of  interest 
by  x  for  notational  convenience. 

It  is  assumed  that  the  two  samples  are  from  multivariate  normal  populations  with 
a  common  covariance  matrix;  that  is, 


(  yn 

V  xn 

(  yn 

V  X2l 


where 


yt2 

Xl2 

y22 

x22 


yini  \ 

Xl„i  ) 

yin2  \ 
X2«2  ) 


are  from  Np+q(fi\  ,2), 
are  from  Np+q{fi2,X), 


We  partition  the  sample  mean  vectors  and  covariance  matrix  accordingly: 


where  Spi  is  the  pooled  sample  covariance  matrix  from  the  two  samples. 
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We  wish  to  test  the  hypothesis  that  xi  and  X2  are  redundant  for  separating  the 
two  groups,  that  is,  that  the  extra  q  variables  do  not  contribute  anything  significant 
beyond  the  information  already  available  in  yi  and  y2  for  separating  the  groups. 
This  is  in  the  spirit  of  a  full  and  reduced  model  test  in  regression  [see  (5.31)  and 
Section  10.2.5b].  However,  here  we  are  working  with  a  subset  of  dependent  variables 
as  contrasted  to  the  subset  of  independent  variables  in  the  regression  setting.  Thus 
both  y  and  x  are  subvectors  of  dependent  variables.  In  this  setting,  the  independent 
variables  would  be  grouping  variables  1  and  2  corresponding  to  fx\  and  fxi. 

We  are  not  asking  if  the  x’s  can  significantly  separate  the  two  groups  by  them¬ 
selves,  but  whether  they  provide  additional  separation  beyond  the  separation  already 
achieved  by  the  y’s.  If  the  x’s  were  independent  of  the  y’s,  we  would  have  Tp+q  — 
Tp  +  T2,  but  this  does  not  hold,  because  they  are  correlated.  We  must  compare  Tp+q 
for  the  full  set  of  variables  (yi, . . .  ,yp,x\, ...  ,  xq)  with  T2  based  on  the  reduced 
set  (yi,  . . .  ,  yp).  We  are  inquiring  if  the  increase  from  T2  to  Tp+q  is  significant. 

By  definition,  the  T2 -statistic  based  on  the  full  set  of  p  +  q  variables  is  given  by 


2  »i«2  \(  yj 

p+q  «i+n2|A*l 


whereas  T~  for  the  reduced  set  of  p  variables  is 


Tp  = 

'  n  i  +  n  2 


yi/Syy  (yi  ■’ 


y2)- 


(5.27) 


(5.28) 


Then  the  test  statistic  for  the  significance  of  the  increase  from  T2  to  Tp+q  is  given 
by 


T2(x|y)  =  (v-p)Tp+q  ,  (5.29) 

v  +  T- 

which  is  distributed  as  Tqv_p.  We  reject  the  hypothesis  of  redundancy  of  x  if 
T2(x|y)  >  T2q  v_p. 

By  (5.7),  7’2(x|y)  can  be  converted  to  an  /^-statistic: 


F  = 


v  -  p  -  q  +  1 


p+q 


-  Tr 


v  +  T2 


(5.30) 


which  is  distributed  as  Fq  v_p_q+i,  and  we  reject  the  hypothesis  if  F  >  Fa  q  v_p_q+\ . 

In  both  cases  v  =  n\  +  n 2  —  2.  Note  that  the  first  degrees-of-freedom  parameter 
in  both  (5.29)  and  (5.30)  is  q.  the  number  of  x’s.  The  second  parameter  in  (5.29)  is 
v  —  p  because  the  statistic  is  adjusted  for  the  p  variables  in  y. 

To  prove  directly  that  the  statistic  defined  in  (5.30)  has  an  F-distribution,  we  can 
use  a  basic  relationship  from  multiple  regression  [see  (10.33)]: 
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Fq,v—p—q+ 1 


(R2p+q  -  R-p)iy  -  p-q  +  l) 


(1 


R2p+q)ci 


(5.31) 


where  R2p+q  is  the  squared  multiple  correlation  from  the  full  model  with  p  +  q  inde¬ 
pendent  variables  and  R 2  is  from  the  reduced  model  with  p  independent  variables.  If 
we  solve  for  R2  in  terms  of  T2  from  (5.20)  and  substitute  this  into  (5.31),  we  readily 
obtain  the  test  statistic  in  (5.30). 

If  we  are  interested  in  the  effect  of  adding  a  single  x,  then  q  =  1,  and  both  (5.29) 
and  (5.30)  reduce  to 


j-2  _  j-2 

t2{x\y)  =  (v- p)  p+\  /,  (5.32) 

v  +  Tp 

and  we  reject  the  hypothesis  of  redundancy  of  .r  if  /2(x|y)  >  t2^  v_p  —  Faiv-p. 


Example  5.8.  We  use  the  psychological  data  of  Table  5.1  to  illustrate  tests  on  sub¬ 
vectors.  We  begin  by  testing  the  significance  of  \>3  and  34  above  and  beyond  y\  and 
y'2-  (In  the  notation  of  the  present  section,  323  and  34  become  x\  and  xj.)  For  these 
subvectors,  p  —  2  and  q  —  2.  The  value  of  T2+q  for  all  four  variables  as  given 
by  (5.27)  was  obtained  in  Example  5.4.2  as  97.6015.  For  vi  and  yi,  we  obtain,  by 
(5.28), 


n  in  2 
n\  +  n2 


(yi  -  y2)'Syv(yi  -y2) 


(32)2  /  15.97  -  12.34  V  /  7.16  6.05 

~  32  +  32  V  15.91-13.91  J  {6.05  15.89 

=  31.0126. 


15.97  -  12.34  \ 
15.91-13.91  ) 


By  (5.29),  the  test  statistic  is 


T2(y-i,  y4\yi,  y2)  =  (v  -  p)-p+q  p 


V  +  T2 


=  (62  -  2) 


97.6015  -  31.0126 
62  +  31.0126 


42.955. 


We  reject  the  hypothesis  that  x  —  (y 3,  y4)/  is  redundant,  since  42.955  >  T qj  2  60  = 
10.137.  We  conclude  that  x  =  (V3,  3*4)'  adds  a  significant  amount  of  separation  to 
y  =  (yt,y2)'. 

To  test  the  effect  of  each  variable  adjusted  for  the  other  three,  we  use  (5.32).  In 
this  case,  p  =  3,  v  =  62,  and  v  —  p  —  59.  The  results  are  given  below,  where 
T2+]  =  97.6015  and  T2  in  each  case  is  based  on  the  three  variables,  excluding  the 
variable  in  question.  For  example,  T2  =  90.8348  for  3-2  is  based  on  34 ,  V3,  and  34, 
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and  t2{y2\y\,  yi,  B)  =  2.612: 


Variable 

n 

rr 2  _  rr2 

,  \  1  p+ 1  1  p 

(V  P]  I  T-> 

v  +  T- 

yi 

78.8733 

7.844 

90.8348 

2.612 

32.6253 

40.513 

?4 

74.5926 

9.938 

When  we  compare  these  four  test  statistic  values  with  the  critical  value  t2p5  5g  = 
4.002,  we  see  that  each  variable  except  y2  makes  a  significant  contribution  to  T2. 
Note  that  y3  contributes  most,  followed  by  34  and  then  y  1 .  This  order  differs  from 
that  given  by  the  raw  discriminant  function  in  Example  5.5  but  agrees  with  the  order 
for  the  standardized  discriminant  function  given  in  the  answer  to  Problem  8.7  in 
Appendix  B.  □ 


5.9  PROFILE  ANALYSIS 

If  y  is  Np(p.  2)  and  the  variables  in  y  are  commensurate  (measured  in  the  same 
units  and  with  approximately  equal  variances  as,  for  example,  in  the  probe  word 
data  in  Table  3.5),  we  may  wish  to  compare  the  means  /x  1 ,  n2, ...  ,  pp  in  /x.  This 
might  be  of  interest  when  a  measurement  is  taken  on  the  same  research  unit  at  p 
successive  times.  Such  situations  are  often  referred  to  as  repeated  measures  designs 
or  growth  cur\’es,  which  are  discussed  in  some  generality  in  Sections  6.9  and  6. 10.  In 
the  present  section,  we  discuss  one-  and  two-sample  profile  analysis.  Profile  analysis 
for  several  samples  is  covered  in  Section  6.8. 

The  pattern  obtained  by  plotting  p,\,  p,2, ...  ,  np  as  ordinates  and  connecting  the 
points  is  called  &  profile',  we  usually  draw  straight  lines  connecting  the  points  ( 1 ,  /x  1 ) , 
(2,  // 2 ) , ...  •  ( p.  pip).  Profile  analysis  is  an  analysis  of  the  profile  or  a  comparison  of 
two  or  more  profiles.  Profile  analysis  is  often  discussed  in  the  context  of  administer¬ 
ing  a  battery  of  p  psychological  or  other  tests. 

In  growth  curve  analysis,  where  the  variables  are  measured  at  time  intervals,  the 
responses  have  a  natural  order.  In  profile  analysis  where  the  variables  arise  from  test 
scores,  there  is  ordinarily  no  natural  order.  A  distinction  is  not  always  made  between 
repeated  measures  of  the  same  variable  through  time  and  profile  analysis  of  several 
different  commensurate  variables  on  the  same  individual. 


5.9.1  One-Sample  Profile  Analysis 

We  begin  with  a  discussion  of  the  profile  of  the  mean  vector  p  from  a  sin¬ 
gle  sample.  A  plot  of  p  might  appear  as  in  Figure  5.3,  where  we  plot  (1,/xi), 
(2,  p 2), ...  ,  (p,  pip)  and  connect  the  points. 
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Means 


— . - . - . - -  Variables 

12  3  4 

Figure  5.3.  Profile  of  a  mean  vector. 


In  order  to  compare  the  means  Ml ,  /lit,  ...  .  pp  in  p.  the  basic  hypothesis  is  that 
the  profile  is  level  or  flat: 

Hq:  mi  =  M2  =  •  •  •  =  Mp  vs.  H\\  /Xj  ^  /j-k  for  some  j  ^  k. 

The  data  matrix  Y  is  given  in  (3.17).  We  cannot  use  univariate  analysis  of  variance  to 
test  Hq  because  the  columns  in  Y  are  not  independent.  For  a  multivariate  approach 
that  allows  for  correlated  variables,  we  first  express  Hq  as  p  —  1  comparisons. 


Hq- 


Hq: 


Ml  - 

M2 

M2  - 

M3 

P'p- 1 

_  Mp 

(  /xi  - 

M2  ^ 

Ml  - 

M3 

V  Ml  - 

llP  / 

(  0\ 

0 


Vo  ) 


/0\ 

0 


Vo  ) 


These  two  expressions  can  be  written  in  the  form  Hq  :  Cj  p  —  0  and  Hq  :  C2/X  =  0, 
where  Cj  and  C2  are  the  (p  —  1)  x  p  matrices: 


(  1 

-1 

0  •• 

■  0  V 

( 1 

-1 

0 

0  \ 

Ci  = 

0 

1 

-1  •  • 

0 

C2  = 

1 

0 

-1  •• 

0 

l  0 

0 

0  •  ■ 

•  -1 ) 

1 1 

0 

0  ■■ 

•  -1  / 
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In  fact,  any  (p  —  1)  x  p  matrix  C  of  rank  p  —  1  such  that  Cj  =  0  can  be  used  in 
//o :  Cp  —  0  to  produce  H<p.  p\  =  p2  =  •  ■  •  =  pp.  If  Cj  =  0,  each  row  c'  of  C 
sums  to  zero  by  (2.38).  A  linear  combination  c'  p  =  cup, \  +  Ci2P2  +  •  •  •  +  Cjppp  is 
called  a  contrast  in  the  p’s  if  the  coefficients  sum  to  zero,  that  is,  if  ^  •  c; j  =  0.  The 
p  —  I  contrasts  in  Cp  must  be  linearly  independent  in  order  to  express  Hq:  p  i  = 
P2  =  ■  ■  ■  —  Pp  as  Hq  :  Cp  —  0.  Thus  rank(C)  =  p  —  1. 

From  a  sample  yi,  yo, . . .  ,  y„ ,  we  obtain  estimates  y  and  S  of  population  param¬ 
eters  p  and  2.  To  test  Hq  :  Cp  =  0,  we  transform  each  y,-,  i  —  1,2,...  ,  n,  to  z,  = 
Cy,-,  which  is  (/?  —  1)  x  1.  By  (3.62)  and  (3.64),  the  sample  mean  vector  and  covari¬ 
ance  matrix  of  z,  =  Cy,-,  i  =  1,2,...  ,  n,  are  z  =  Cy  and  S-  =  CSC',  respectively. 
If  y  is  Np(p,  2),  then  by  property  lb  in  Section  4.2,  z  =  Cy  is  Np-  |  (Cp.  C2C'). 
Thus  when  Ho :  Cp  =  0  is  true,  Cy  is  Np-\ (0,  C2C'/«),  and 

(Cy)  =  «(Cy)'(CSC')_1(Cy)  (5.33) 

is  distributed  as  T^_ ]  ;j_1.  We  reject  //q  :  Cp  =  0  if  T2  >  p_{  n  l.  The  dimen¬ 

sion  p  —  1  corresponds  to  the  number  of  rows  of  C.  Thus  z  =  Cy  is  ( p  —  1)  x  1 
and  S-  =  CSC'  is  ( p  —  1)  x  (p  —  1).  Note  that  the  C’s  in  (5.33)  don’t  “cancel” 
because  C  is  ( p  —  1)  x  p  and  does  not  have  an  inverse.  In  fact,  T2  in  (5.33)  is  less 
than  T2  =  ny'S-1^  [see  Rencher  (1998,  p.  84)]. 

If  the  variables  have  a  natural  ordering,  as,  for  example,  in  the  ramus  bone  data  in 
Table  3.6,  we  could  test  for  a  linear  trend  or  polynomial  curve  in  the  means  by  suit¬ 
ably  choosing  the  rows  of  C.  This  is  discussed  in  connection  with  growth  curves  in 
Section  6.10.  Other  comparisons  of  interest  can  be  made  as  long  as  they  are  linearly 
independent. 


5.9.2  Two-Sample  Profile  Analysis 

Suppose  two  independent  groups  or  samples  receive  the  same  set  of  p  tests  or  mea¬ 
surements.  If  these  tests  are  comparable,  for  example,  all  on  a  scale  of  0  to  100,  the 
variables  will  often  be  commensurate. 

Rather  than  testing  the  hypothesis  that  pi  =  pi,  we  wish  to  be  more  specific  in 
comparing  the  profiles  obtained  by  connecting  the  points  (j,  pij),  j  —  1,2,...  .  p, 
and  (j,  P-2  j),  j  —  1,2,...  ,  p.  There  are  three  hypotheses  of  interest  in  comparing 
the  profiles  of  two  samples.  The  first  of  these  hypotheses  addresses  the  question, 
Are  the  two  profiles  similar  in  appearance,  or  more  precisely,  are  they  parallel?  We 
illustrate  this  hypothesis  in  Figure  5.4.  If  the  two  profiles  are  parallel,  then  one  group 
scored  uniformly  better  than  the  other  group  on  all  p  tests. 

The  parallelism  hypothesis  can  be  defined  in  terms  of  the  slopes.  The  two  profiles 
are  parallel  if  the  two  slopes  for  each  segment  are  the  same.  If  the  two  profiles  are 
parallel,  the  two  increments  for  each  segment  are  the  same,  and  it  is  not  necessary  to 
use  the  actual  slopes  to  express  the  hypothesis.  We  can  simply  compare  the  increase 
from  one  point  to  the  next.  The  hypothesis  can  thus  be  expressed  as  Ho\ :  p.  \ ,  — 
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(6)  Hypothesis  is  false 
(Pl4  -  P13  •*  P24  “  t*23) 


Figure  5.4.  Comparison  of  two  profiles  under  the  hypothesis  of  parallelism. 


MlJ-l  =  M2;  -  M2J— 1  for  j  =  2,3 - -  p ,  or 


#01 : 


1  M12  —  Mil  ' 

'  M  22  —  M21  ' 

M13  —  M12 

= 

M  23  —  M  22 

which  can  be  written  as  Hq\  :  Cpn  =  C/x2,  using  the  contrast  matrix 


C  = 


/ 


-1 

0 


1  0 

-1  1 


°\ 

0 


\  o  oo-i/ 


From  two  samples,  yn,  yi2, . . .  ,  ytni  and  y2i,  y22,  •  •  ■  ,  y2 «2,  we  obtain  yu  y2, 
and  Spi  as  estimates  of  ,  pt2,  and  2.  As  in  the  two-sample  7’2-test,  we  assume 
that  each  yp  in  the  first  sample  is  Np(n i ,  2),  and  each  v2,  in  the  second  sample  is 
Np(fJL2,  X).  If  C  is  a  (/7  —  1)  x  p  contrast  matrix,  as  before,  then  Cy1(-  and  Cy2;- 
are  distributed  as  N  p-\(Cfx\,  CXC')  and  N p-\  (C/ut2,  CXC'),  respectively.  Under 
Hq\  :  C/u.1  —  Cfx.2  —  0,  the  random  vector  Cyi  —  Cy2  is  (Vp_i[0,  CXC'(1  / n\  + 
l/tt2)],  and 


T2  =  (Cyj  -  Cy2)' 


- 1 - I  CSpiC' 

77 1  772  ' 


-1 


(Cy,  -  Cy2) 


-^-(Jt  -^/C'tCSptCr'C^  -y2) 

77 1  +  772 


(5.34) 


is  distributed  as  T2  ,  .  Note  that  the  dimension  p  —  1  is  the  number  of  rows 

p—  2.  r 

of  C. 
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By  analogy  with  the  discussion  in  Section  5.5,  if  AAoi  is  rejected,  we  can  follow 
up  with  univariate  tests  on  the  individual  components  of  C(yj  —  y2).  Alternatively, 
we  can  calculate  the  discriminant  function 

a=(CSpiC'r1C(y1-y2)  (5.35) 

as  an  indication  of  which  slope  differences  contributed  most  to  rejection  of  //q  i  in 
the  presence  of  the  other  components  of  C(yj  —  y2).  There  should  be  less  need  in 
this  case  to  standardize  the  components  of  a,  as  suggested  in  Section  5.5,  because  the 
variables  are  assumed  to  be  commensurate.  The  vector  a  is  (p—  1 )  x  1,  corresponding 
to  the  p  —  1  segments  of  the  profile.  Thus  if  the  second  component  of  a,  for  example, 
is  largest  in  absolute  value,  the  divergence  in  slopes  between  the  two  profiles  on  the 
second  segment  contributes  most  to  rejection  of  H{\  \ . 

If  the  data  are  arranged  as  in  Table  5.4,  we  see  an  analogy  to  a  two-way  ANOVA 
model.  A  plot  of  the  means  is  often  made  in  a  two-way  ANOVA;  a  lack  of  paral¬ 
lelism  corresponds  to  interaction  between  the  two  factors.  Thus  the  hypothesis  Hq\ 
is  analogous  to  the  group  by  test  (variable)  interaction  hypothesis. 

However,  the  usual  ANOVA  assumption  of  independence  of  observations  does 
not  hold  here  because  the  variables  (tests)  are  correlated.  The  ANOVA  assumption 
of  independence  and  homogeneity  of  variances  would  require  cov(y)  —  X  =  er2I. 
Hence  the  test  of  Hoi  cannot  be  carried  out  using  a  univariate  ANOVA  approach, 
since  X  ^  er2I.  We  therefore  proceed  with  the  multivariate  approach  using  T2. 

The  second  hypothesis  of  interest  in  comparing  two  profiles  is,  Are  the  two  pop¬ 
ulations  or  groups  at  the  same  level ?  This  hypothesis  corresponds  to  a  group  (popu¬ 
lation)  main  effect  in  the  ANOVA  analogy.  We  can  express  this  hypothesis  in  terms 
of  the  average  level  of  group  1  compared  to  the  average  level  of  group  2: 

AHl  +  AH2  +  •  •  •  +  AH  p  AHl  +  AH2  +  •  •  •  +  AH/j 
H02 ■  - =  - . 

P  P 


Table  5.4.  Data  Layout  for  Two-Sample  Profile  Analysis 


Tests  (variables) 

1  2 

p 

Group  1 

y'n 

= 

(ym  yn2 

All  p) 

y'12 

— 

(yi2i  yi22 

ynp) 

y'ini 

= 

C  V  l/7i  1  yini2 

yinlP) 

Group  2 

3^21 

= 

fell  V212 

ynp ) 

y'i2 

— 

O221  V222 

yi2P) 

y  2ti  2 

= 

ten2l  y  2/122 

y2n2P) 
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Figure  5.5.  Hypothesis  Hq2  of  equal  group  effect,  assuming  parallelism. 


By  (2.37),  this  can  be  expressed  as 


Hoi'-  jVl  =  jV2- 

If  H$\  is  true,  Hqi  can  be  pictured  as  in  Figure  5.5a.  If  //02  is  false,  then  the  two 
profiles  differ  by  a  constant  (given  that  Ho  \  is  true),  as  in  Figure  5.5 b. 

The  hypothesis  f  l 02  can  be  true  when  Ho  ]  does  not  hold.  Thus  the  average  level 
of  population  1  can  equal  the  average  level  of  population  2  without  the  two  profiles 
being  parallel,  as  illustrated  in  Figure  5.6.  In  this  case,  the  “group  main  effect”  is 
somewhat  harder  to  interpret,  as  is  the  case  in  the  analogous  two-way  ANOVA,  where 
main  effects  are  more  difficult  to  describe  in  the  presence  of  significant  interaction. 
However,  the  test  may  still  furnish  useful  information  if  a  careful  description  of  the 
results  is  provided. 


Figure  5.6.  Hypothesis  H02  of  equal  group  effect  without  parallelism. 
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To  test  Htyi:  j'(Mi  -  M2)  =  0,  we  estimate  j'(Mi  -  /l*,2)  by  j'(yi  -  y2),  which  is 
N[0,  j'Sj(l/ni  +  1  / M2)]  when  H02  is  true.  We  can  therefore  use 


j'(yi  -  y2) 

yj'Spijd/m  + 1/«2) 


(5.36) 


and  reject  tf02  if  \t\  >  ta/2,m+n2-2- 

The  third  hypothesis  of  interest,  corresponding  to  the  test  (or  variable)  main  effect, 
is.  Are  the  profiles  flat?  Assuming  parallelism  (assuming  Hq\  is  true),  the  “flatness” 
hypothesis  can  be  pictured  as  in  Figure  5.7.  If  Hq\  is  not  true,  the  test  could  be  carried 
out  separately  for  each  group  using  the  test  in  Section  5.9.1.  If  H02  is  true,  the  two 
profiles  in  Figure  5.7 a  and  Figure  5.7 b  will  be  coincident. 

To  express  the  third  hypothesis  in  a  form  suitable  for  testing,  we  note  from  Fig¬ 
ure  5.7 a  that  the  average  of  the  two  group  means  is  the  same  for  each  test: 


H03 :  j(fi  11  +  M21)  =  2 (M12  +  M22)  =  •  •  •  =  j(ni P  +  M2 P) 


(5.37) 


or 


//03:  \ C(fii  +  H2)  —  0,  (5.38) 

where  C  is  a  (p  —l)xp  matrix  such  that  Cj  =  0.  From  Figure  5.7 a,  we  see  that  //03 
could  also  be  expressed  as  /i  1 1  =  /u  12  —  •  ••  =  pip  and  ^21  =  M22  —  •••  =  p,2p,  or 

H03:  Cp.\  —  0  and  Cp.2  —  0. 

To  estimate  4 (Ml  +  M2h  we  use  the  sample  grand  mean  vector  based  on  a 
weighted  average: 


y  = 


«iyi  +  «2y2 

n  1  +  ri2 


(a)  Hypothesis  is  true 


Figure  5.7.  Hypothesis  H0 3  of  equal  tests  (variables)  assuming  parallelism. 
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It  can  easily  be  shown  that  under  H03  (and  Hoi),  E( Cy)  =  0  and  cov(y)  =  S/(«i  + 
112).  Therefore,  Cy  is  Hp_i[0,  CSC'/ (n \  +  «2)L  and 


T2  =  (Cy )' 


CSpiC' 

n  1  +  «2 


(Cy) 


=  (n1  +  «2)(Cy),(CSplC,)-1Cy 


(5.39) 


is  distributed  as  T^_ j  ni+f!2_2  when  both  Hoi  and  H03  are  true.  It  can  be  readily 
shown  that  H03  is  unaffected  by  a  difference  in  the  profile  levels  (unaffected  by  the 
status  of  Hq 2). 


Example  5.9.2.  We  use  the  psychological  data  in  Table  5.1  to  illustrate  two-sample 
profile  analysis.  The  values  of  y1;  y>,  and  Spi  are  given  in  Example  5.4.2.  The  profiles 
of  the  two  mean  vectors  yj  and  y2  are  plotted  in  Figure  5.8.  There  appears  to  be  a 
lack  of  parallelism. 

To  test  for  parallelism,  Hoi :  Cp,|  =  Cfi2,  we  use  the  matrix 

/  -1  1  0  0  \ 

c=  0-1  10 

\  0  0—11/ 


Variable 


Figure  5.8.  Profiles  for  the  psychological  data  in  Table  5.1. 
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and  obtain 


C(yx  -  y2)  = 


/  -1.62  \ 

1 

<  10.96 

-7.05 

-1.64  \ 

8.53  , 

CSpiC'  = 

-7.05 

27.26 

-12.74 

I  -9.72  ) 

,  -1.64 

-12.74 

23.72  I 

Then,  by  (5.34), 


T 2  =  ^f^t  -  y2),C,(CSplC')-1C(y1  -  y2)  =  74.240. 


Upon  comparison  of  this  value  with  T ^  3  g2  =  12.796  (obtained  by  interpolation  in 
Table  A. 7),  we  reject  the  hypothesis  of  parallelism. 

In  Figure  5.8  the  lack  of  parallelism  is  most  notable  in  the  second  and  third  seg¬ 
ments.  This  can  also  be  seen  in  the  relatively  large  values  of  the  second  and  third 
components  of 


C(yj  -  y2)  = 


-1.62  \ 
8.53 

-9.72  / 


To  see  which  of  these  made  the  greatest  statistical  contribution,  we  can  examine  the 
discriminant  function  coefficient  vector  given  in  (5.35)  as 


a  =  (CSplC')_1C(y1  -  y2)  = 


-.136  \ 
.104 

-.363  / 


Thus  the  third  segment  contributed  most  to  rejection  in  the  presence  of  the  other  two 
segments. 

To  test  for  equal  levels,  //q2  :  'l  /J-i  —  j  pi2,  we  use  (5.36), 


=  j'(yi  -y2) 
yj'Spij(l/ni  +  l/«2) 

16.969 

~~  V ( 164. 276) (1  /32  +  1/32) 


5.2957. 


Comparing  this  with  f.oo5,62  =  2.658,  we  reject  the  hypothesis  of  equal  levels. 
To  test  the  flatness  hypothesis,  //q3  :  +  ju,2)  =  0,  we  first  calculate 


y  = 


32yi  +  32y2 


yt  +  y2 


/  14.16  \ 

14.91 

21.92 

v  22.34  j 


32  +  32 


2 
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Using 

/  1  — 1  0  0  \ 

C=  0  1-1  0  , 

\  0  0  1  -1  / 

we  obtain,  by  (5.39), 

T 2  =  (32  +  32)(Cy),(CSpiC,r1Cy  =  254.004, 

which  exceeds  T2n  3  62  =  12.796,  so  we  reject  the  hypothesis  of  flatness.  However, 
since  the  parallelism  hypothesis  was  rejected,  a  more  appropriate  approach  would  be 
to  test  each  of  the  two  groups  separately  for  flatness  using  the  test  of  Section  5.9.1. 
By  (5.33),  we  obtain 

T2  =  wKCyO'lCSiC'r^Cyj)  =  221.126, 

T 2  =  «2(Cy2)'(CS2C,rl(Cy2)  =  103.483. 

Both  of  these  exceed  T qj  ,  31  =  14.626,  and  we  have  significant  lack  of  flatness. 

□ 


PROBLEMS 

5.1  Show  that  the  characteristic  form  of  T 2  in  (5.6)  is  the  same  as  the  original  form 
in  (5.5). 

5.2  Show  that  the  '/^-statistic  in  (5.9)  can  be  expressed  in  the  characteristic  form 
given  in  (5.10). 

5.3  Show  that  t2( a)  =  T2,  where  /(a)  is  given  by  (5.13),  T2  is  given  by  (5.9),  and 
a  =  Spi^yr  -y2)  as  in  (5.14). 

5.4  Show  that  the  paired  observation  /-test  in  (5.22),  t  —  d / (sd/ *Jn).  has  the  tn-\ 
distribution. 

5.5  Show  that  sj  =  Y^'i=\  (di  —  d)2/(n  —  1)  =  s2  +  s2  —  2 syx,  as  in  a  comparison 
of  (5.22)  and  (5.23). 

5.6  Show  that  T 2  —  nd  1  d  in  (5.24)  has  the  characteristic  form  T2  —  dVs,/ /n)_1d. 

5.7  Use  (5.7)  to  show  that  r2(x|y)  in  (5.29)  can  be  converted  to  F  as  in  (5.30). 

5.8  Show  that  the  test  statistic  in  (5.30)  for  additional  information  in  x  above  and 
beyond  y  has  an  E-distribution  by  solving  for  R2  in  terms  of  T2  from  (5.20) 
and  substituting  this  into  (5.31). 

5.9  In  Section  5.9.2,  show  that  under  //03  and  Hoi,  E( Cy)  =  0  and  cov(y)  = 
X/(n  1  +  n2),  where  y  =  («tyi  +  n2y2)/(«i  +  112)  and  X  is  the  common 
covariance  matrix  of  the  two  populations  from  which  y3  and  y2  are  sampled. 


PROBLEMS 


149 


5.10  Verify  that  T 2  =  {n\  +  H2)(Cy)'(CSpiC')  *Cy  in  (5.39)  has  the  7^_1,„1+„2_2 
distribution. 

5.11  Test  Hq  :  /x'  =  (6,  1 1)  using  the  data 

\ 

5.12  Use  the  probe  word  data  in  Table  3.5: 

(a)  Test  H0  :  ijl  =  (30,  25, 40,  25,  30)'. 

(b)  If  Hq  is  rejected,  test  each  variable  separately,  using  (5.3). 

5.13  For  the  probe  word  data  in  Table  3.5,  test  Ho:  jx\  =  /ij  =  ■  ■  ■  =  ho,  using  T2 
in  (5.33). 

5.14  Use  the  ramus  bone  data  in  Table  3.6: 

(a)  Test  H0  :  ijl  =  (48, 49,  50,  51)'. 

(b)  If  Hq  is  rejected,  test  each  variable  separately,  using  (5.3). 

5.15  For  the  ramus  bone  data  in  Table  3.6,  test  Hq  :  /i  \  =  /Z2  =  113  =  /Z4,  using  T2 
in  (5.33). 

5.16  Four  measurements  were  made  on  two  species  of  flea  beetles  (Lubischew 
1962).  The  variables  were 

yi  =  distance  of  transverse  groove  from  posterior  border  of  prothorax  (/zm), 
yi  —  length  of  elytra  (in  .01  mm), 

V3  =  length  of  second  antennal  joint  (/xm), 
y4  =  length  of  third  antennal  joint  (/xm). 

The  data  are  given  in  Table  5.5. 

(a)  Test  Hq:  fx\  =  m  using  T2. 

(b)  If  the  7’2-test  in  part  (a)  rejects  Hq ,  carry  out  a  /-test  on  each  variable,  as 
in  (5.15). 

(c)  Calculate  the  discriminant  function  coefficient  vector  a  =  1  (yj  —  y2). 

(d)  Show  that  if  the  vector  a  found  in  part  (c)  is  substituted  into  /2(a)  from 
(5.13),  the  result  is  the  same  as  the  value  of  T2  found  in  part  (a). 

(e)  Obtain  T2  using  the  regression  approach  in  Section  5.6.2. 

(f)  Test  the  significance  of  each  variable  adjusted  for  the  other  three. 

(g)  Test  the  significance  of  >’3  and  V4  adjusted  for  y\  and  yi. 

5.17  Carry  out  a  profile  analysis  on  the  beetles  data  in  Table  5.5. 


Y  = 


/  3 
6 
5 

V  10 


10 

12 

14 

9 
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Table  5.5.  Four  Measurements  on  Two  Species  of  Flea  Beetles 


Haltica  oleracea  Haltica  carduorum 


Experiment  Experiment 


Number 

yi 

V2 

v4 

Number 

yi 

,V2 

V3 

T4 

1 

189 

245 

137 

163 

1 

181 

305 

184 

209 

2 

192 

260 

132 

217 

2 

158 

237 

133 

188 

3 

217 

276 

141 

192 

3 

184 

300 

166 

231 

4 

221 

299 

142 

213 

4 

171 

273 

162 

213 

5 

171 

239 

128 

158 

5 

181 

297 

163 

224 

6 

192 

262 

147 

173 

6 

181 

308 

160 

223 

7 

213 

278 

136 

201 

7 

177 

301 

166 

221 

8 

192 

255 

128 

185 

8 

198 

308 

141 

197 

9 

170 

244 

128 

192 

9 

180 

286 

146 

214 

10 

201 

276 

146 

186 

10 

177 

299 

171 

192 

11 

195 

242 

128 

192 

11 

176 

317 

166 

213 

12 

205 

263 

147 

192 

12 

192 

312 

166 

209 

13 

180 

252 

121 

167 

13 

176 

285 

141 

200 

14 

192 

283 

138 

183 

14 

169 

287 

162 

214 

15 

200 

294 

138 

188 

15 

164 

265 

147 

192 

16 

192 

277 

150 

177 

16 

181 

308 

157 

204 

17 

200 

287 

136 

173 

17 

192 

276 

154 

209 

18 

181 

255 

146 

183 

18 

181 

278 

149 

235 

19 

192 

287 

141 

198 

19 

175 

271 

140 

192 

20 

197 

303 

170 

205 

5.18  Twenty  engineer  apprentices  and  20  pilots  were  given  six  tests  (Travers  1939). 
The  variables  were 


y\  =  intelligence, 

V2  =  form  relations, 
y2  =  dynamometer, 

V4  =  dotting, 

y5  —  sensory  motor  coordination, 

Y(,  =  perseveration. 

The  data  are  given  in  Table  5.6. 

(a)  Test  H0:  i  =  fi2. 

(b)  If  the  7'2-test  in  part  (a)  rejects  Ho,  carry  out  a  f-test  for  each  variable,  as 
in  (5.15). 

(c)  Test  each  variable  adjusted  for  the  other  five. 

(d)  Test  the  significance  of  y4,  y$,  yo  adjusted  for  yi,  y2,  y2. 
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Table  5.6.  Comparison  of  Six  Tests  on  Engineer  Apprentices  and  Pilots 


Engineer  Apprentices  Pilots 


yi 

y-i 

y4 

?5 

y6 

yi 

V2 

ys 

y4 

ys 

ye 

121 

22 

74 

223 

54 

254 

132 

17 

77 

232 

50 

249 

108 

30 

80 

175 

40 

300 

123 

32 

79 

192 

64 

315 

122 

49 

87 

266 

41 

223 

129 

31 

96 

250 

55 

319 

77 

37 

66 

178 

80 

209 

131 

23 

67 

291 

48 

310 

140 

35 

71 

175 

38 

261 

110 

24 

96 

239 

42 

268 

108 

37 

57 

241 

59 

245 

47 

22 

87 

231 

40 

217 

124 

39 

52 

194 

72 

242 

125 

32 

87 

227 

30 

324 

130 

34 

89 

200 

85 

242 

129 

29 

102 

234 

58 

300 

149 

55 

91 

198 

50 

277 

130 

26 

104 

256 

58 

270 

129 

38 

72 

162 

47 

268 

147 

47 

82 

240 

30 

322 

154 

37 

87 

170 

60 

244 

159 

37 

80 

227 

58 

317 

145 

33 

88 

208 

51 

228 

135 

41 

83 

216 

39 

306 

112 

40 

60 

232 

29 

279 

100 

35 

83 

183 

57 

242 

120 

39 

73 

159 

39 

233 

149 

37 

94 

227 

30 

240 

118 

21 

83 

152 

88 

233 

149 

38 

78 

258 

42 

271 

141 

42 

80 

195 

36 

241 

153 

27 

89 

283 

66 

291 

135 

49 

73 

152 

42 

249 

136 

31 

83 

257 

31 

311 

151 

37 

76 

223 

74 

268 

97 

36 

100 

252 

30 

225 

97 

46 

83 

164 

31 

243 

141 

37 

105 

250 

27 

243 

109 

42 

82 

188 

57 

267 

164 

32 

76 

187 

30 

264 

5.19  Data  were  collected  in  an  attempt  to  find  a  screening  procedure  to  detect  carri¬ 
ers  of  Duchenne  muscular  dystrophy,  a  disease  transmitted  from  female  carriers 
to  some  of  their  male  offspring  (Andrews  and  Herzberg  1985,  pp.  223-228). 
The  following  variables  were  measured  on  a  sample  of  noncarriers  and  a  sam¬ 
ple  of  carriers: 


>’i  =  age, 

y2  =  month  in  which  measurements  are  taken, 

y3  =  creatine  kinase, 

y4  =  hemopexin, 

y5  =  lactate  dehydrogenase, 

y6  =  pyruvate  kinase. 

The  data  are  given  in  Table  5.7. 

(a)  Test  //() :  m  =  /x2  using  y3,  y4,  ys,  and  y6. 

(b)  The  variables  yj,  and  y4  are  relatively  inexpensive  to  measure  compared  to 
ys  and  Do  y3  and  V(,  contribute  an  important  amount  to  7' 2  above  and 
beyond  V3  and  y4? 
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Table  5.7.  Comparison  of  Carriers  and  Noncarriers  of  Muscular  Dystrophy 


Publisher's  Note: 

Permission  to  reproduce  this  image 
online  was  not  granted  by  the 
copyright  holder.  Readers  are  kindly 
asked  to  refer  to  the  printed  version 
of  this  chapter. 
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(c)  The  levels  of  V3,  >’4,  ys,  and  ye  may  depend  on  age  and  season,  y  1  and  y2. 
Do  y\  and  y2  contribute  a  significant  amount  to  T2  when  adjusted  for  >’3, 
y4,  ys,  and  y6? 

5.20  Various  aspects  of  economic  cycles  were  measured  for  consumers’  goods  and 
producers’  goods  by  Tintner  (1946).  The  variables  are 

}’i  =  length  of  cycle, 

y2  =  percentage  of  rising  prices, 

V3  =  cyclical  amplitude, 
y4  =  rate  of  change. 

The  data  for  several  items  are  given  in  Table  5.8. 


Table  5.8.  Cyclical  Measurements  of  Consumer  Goods  and  Producer  Goods 


Item 

yi 

,V2 

,V4 

Item 

yi 

yi 

y4 

Consumer  Goods 

Producer  Goods 

1 

72 

50 

8 

.5 

1 

57 

57 

12.5 

.9 

2 

66.5 

48 

15 

1.0 

2 

100 

54 

17 

.5 

3 

54 

57 

14 

1.0 

3 

100 

32 

16.5 

.7 

4 

67 

60 

15 

.9 

4 

96.5 

65 

20.5 

.9 

5 

44 

57 

14 

.3 

5 

79 

51 

18 

.9 

6 

41 

52 

18 

1.9 

6 

78.5 

53 

18 

1.2 

7 

34.5 

50 

4 

.5 

7 

48 

50 

21 

1.6 

8 

34.5 

46 

8.5 

1.0 

8 

155 

44 

20.5 

1.4 

9 

24 

54 

3 

1.2 

9 

84 

64 

13 

.8 

10 

105 

35 

17 

1.8 

(a)  Test  Ho:  m  —  (x2  using  T 2. 

(b)  Calculate  the  discriminant  function  coefficient  vector. 

(c)  Test  for  significance  of  each  variable  adjusted  for  the  other  three. 

5.21  Each  of  15  students  wrote  an  informal  and  a  formal  essay  (Kramer  1972, 
p.  100).  The  variables  recorded  were  the  number  of  words  and  the  number 
of  verbs: 


y  1  =  number  of  words  in  the  informal  essay, 
y2  =  number  of  verbs  in  the  informal  essay, 
x  1  =  number  of  words  in  the  formal  essay, 
x 2  =  number  of  verbs  in  the  formal  essay. 
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Table  5.9.  Number  of  Words  and  Number  of  Verbs 


Informal  Formal 


Student 

Words 

yi 

Verbs 

T2 

Words 

Xl 

Verbs 

x2 

d\  -  Vi  -  xi 

O- 

II 

to 

1 

X 

1 

148 

20 

137 

15 

+  11 

+5 

2 

159 

24 

164 

25 

-5 

-1 

3 

144 

19 

224 

27 

-80 

-8 

4 

103 

18 

208 

33 

-105 

-15 

5 

121 

17 

178 

24 

-57 

-7 

6 

89 

11 

128 

20 

-39 

-9 

7 

119 

17 

154 

18 

-35 

-1 

8 

123 

13 

158 

16 

-35 

-3 

9 

76 

16 

102 

21 

-26 

-5 

10 

217 

29 

214 

25 

+3 

+4 

11 

148 

22 

209 

24 

-61 

-2 

12 

151 

21 

151 

16 

0 

+5 

13 

83 

7 

123 

13 

-40 

-6 

14 

135 

20 

161 

22 

-26 

-2 

15 

178 

15 

175 

23 

+3 

-8 

Table  5.10.  Survival  Times  for  Bronchus  Cancer  Patients 
and  Matched  Controls 


Ascorbate  Patients 

Matched  Controls 

,Vl 

T2 

Xi 

*2 

81 

74 

72 

33 

461 

423 

134 

18 

20 

16 

84 

20 

450 

450 

98 

58 

246 

87 

48 

13 

166 

115 

142 

49 

63 

50 

113 

38 

64 

50 

90 

24 

155 

113 

30 

18 

151 

38 

260 

34 

166 

156 

116 

20 

37 

27 

87 

27 

223 

218 

69 

32 

138 

138 

100 

27 

72 

39 

315 

39 

245 

231 

188 

65 
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The  data  are  given  in  Table  5.9.  Since  each  student  wrote  both  types  of  essays, 
the  observation  vectors  are  paired,  and  we  use  the  paired  comparison  test. 

(a)  Test  H0:  fid  =  0. 

(b)  Find  the  discriminant  function  coefficient  vector. 

(c)  Do  a  univariate  r-test  on  each  dj. 

5.22  A  number  of  patients  with  bronchus  cancer  were  treated  with  ascorbate  and 
compared  with  matched  patients  who  received  no  ascorbate  (Cameron  and 
Pauling  1978).  The  data  are  given  in  Table  5.10.  The  variables  measured  were 

yi ,  x  i  =  survival  time  (days)  from  date  of  first  hospital  admission, 
yi ,  xi  =  survival  time  from  date  of  untreatability. 

Compare  vi  and  yi  with  x\  and  a 2  using  a  paired  comparison  T  -test. 

5.23  Use  the  glucose  data  in  Table  3.8: 

(a)  Test  Hq  :  ju,v  =  /u,A  using  a  paired  comparison  test. 

(b)  Test  the  significance  of  each  variable  adjusted  for  the  other  two. 


CHAPTER  6 


Multivariate  Analysis  of  Variance 


In  this  chapter  we  extend  univariate  analysis  of  variance  to  multivariate  analysis  of 
variance,  in  which  we  measure  more  than  one  variable  on  each  experimental  unit. 
For  multivariate  analysis  of  covariance,  see  Rencher  (1998,  Section  4.10). 


6.1  ONE-WAY  MODELS 

We  begin  with  a  review  of  univariate  analysis  of  variance  (ANOVA)  before  covering 
multivariate  analysis  of  variance  (MAN OVA)  with  several  dependent  variables. 

6.1.1  Univariate  One-Way  Analysis  of  Variance  (ANOVA) 

In  the  balanced  one-way  ANOVA,  we  have  a  random  sample  of  n  observations  from 
each  of  k  normal  populations  with  equal  variances,  as  in  the  following  layout: 


Sample  1 

Sample  2 

Sample  k 

from  IV Ou-i,  o2) 

from  N(p,2,  o'2) 

from  N(p.k,  a 2 ) 

yn 

y2i 

yn 

Vl2 

yi2 

yki 

yin 

y2„ 

ykn 

Total 

yi. 

y2. 

yk. 

Mean 

7i. 

y2. 

Ik. 

Variance 

s2 

V2 

sk 

The  k  samples  or  the  populations  from  which  they  arise  are  sometimes  referred  to 
as  groups.  The  groups  may  correspond  to  treatments  applied  by  the  researcher  in  an 
experiment.  We  have  used  the  “dot”  notation  for  totals  and  means  for  each  group: 

W  =  £  W-  7,\  (6'1) 

7  —  1  7  =  1 
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The  k  samples  are  assumed  to  be  independent.  The  assumptions  of  independence  and 
common  variance  are  necessary  to  obtain  an  .F-test. 

The  model  for  each  observation  is 

ytj  =  fi  +  a,  +  etj 

=  /it  +  £ij ,  i  =  1,  2, . . .  ,  k;  j  =  1,2, ,  n;  (6.2) 

where  //,  =  //  +  a  j  is  the  mean  of  the  ith  population.  We  wish  to  compare  the  sample 
means  y,- ,  i  =  1,2,...  ,  k,  to  see  if  they  are  sufficiently  different  to  lead  us  to  believe 
the  population  means  differ.  The  hypothesis  can  be  expressed  as  Hq:  /i  i  =  /z2  = 
•  •  •  =  /ik.  Note  that  the  notation  for  subscripts  differs  from  that  of  previous  chapters, 
in  which  the  subscript  i  represented  the  observation.  In  this  chapter,  we  use  the  last 
subscript  in  a  model  such  as  (6.2)  to  represent  the  observation. 

If  the  hypothesis  is  true,  all  y,y  are  from  the  same  population,  N(fi,  a2),  and  we 
can  obtain  two  estimates  of  a2,  one  based  on  the  sample  variances  s2,  s\, . . .  ,  s2: 
[see  (3.4)  and  (3.5)]  and  the  other  based  on  the  sample  means  yj  ,  y2  , . . .  ,  y^  .  The 
pooled  “within-sample”  estimator  of  cr2  is 

2  iy.,2  Ef=i  E"=t  0 nj  -  y,-.)2  E,-j  yfj  ~  E,-  yjJn 

Se  kfr{Si  k(n  —  1)  k(n-  1)  ' 

Our  second  estimate  of  a2  (under  Hq)  is  based  on  the  variance  of  the  sample 
means. 


EL(w-7..)2 

k  —  1 


(6.4) 


where  y  =  E/=i  7;  /k  is  the  overall  mean.  If  Hq  is  true,  Sj  estimates  cr2.  =  a2 /n 
[see  remarks  following  (3.1)  in  Section  3.1],  and  therefore  E(ns j)  =  n(cs2 /n )  =  a2, 
from  which  the  estimate  of  a2  is 

„4  =  =  E 

y  k  -  1  k  -  1 

where  y  =  y,-  =  .  y,y  is  the  overall  total.  If  Hq  is  false,  E(jiSj )  =  a2  + 

n  aj /(k  —  1),  and  nSj  will  tend  to  reflect  a  larger  spread  in  yj  ,  y2  , . . .  ,  y^  . 
Since  s2  is  based  on  variability  within  each  sample,  it  estimates  a2  whether  or  not 
Hq  is  true;  thus  E (s2)  =  a2  in  either  case. 

When  sampling  from  normal  distributions,  ,v2,  a  pooled  estimator  based  on  the  k 
values  of  s2,  is  independent  of  si.  which  is  based  on  the  y,-  ’s.  We  can  justify  this 

assertion  by  noting  that  y,-  and  sf  are  independent  in  each  sample  (when  sampling 
from  the  normal  distribution)  and  that  the  k  samples  are  independent  of  each  other. 
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Since  nsj  and  s2  are  independent  and  both  estimate  a2,  their  ratio  forms  an  F- 
statistic  (see  Section  7.3.1): 


n4  _  (Ef  yfjn  -  y2./kn)  /(k  -  1) 

4  (T,ij  yfj  -  E,  y?Jn)  / [*(»  - 1)] 

SSH  /(k-  1) 

SSE/[k(n  -  1)] 

MSH 

MSE’ 


(6.6) 

(6.7) 


where  SSH  =  yf/n  —  y2/kn  and  SSE  =  JT  •  yfj  ~  E;  yf/n  are  the  “between”- 
sample  sum  of  squares  (due  to  the  means)  and  “within”-sample  sum  of  squares, 
respectively,  and  MSH  and  MSE  are  the  corresponding  sample  mean  squares.  The  F- 
statistic  (6.6)  is  distributed  as  Z7*— i)  when  Hq  is  true.  We  reject  Ho  if  F  >  Fu. 
The  /-’-statistic  (6.6)  can  be  shown  to  be  a  simple  function  of  the  likelihood  ratio. 


6.1.2  Multivariate  One-Way  Analysis  of  Variance  Model  (MANOVA) 

We  often  measure  several  dependent  variables  on  each  experimental  unit  instead  of 
just  one  variable.  In  the  multivariate  case,  we  assume  that  k  independent  random 
samples  of  size  n  are  obtained  from  p-variate  normal  populations  with  equal  covari¬ 
ance  matrices,  as  in  the  following  layout  for  balanced  one-way  multivariate  analysis 
of  variance.  (In  practice,  the  observation  vectors  y,j  would  ordinarily  be  listed  in 
row  form,  and  sample  2  would  appear  below  sample  1,  and  so  on.  See,  for  example, 
Table  6.2.) 


Sample  1 

Sample  2 

Sample  k 

from  Np(f ti,  2) 

from  Np(na,  2) 

from  Np{fik,  2) 

yn 

yzi 

y« 

yi2 

yn 

yk2 

yi» 

Yin 

ykn 

Total  yL 

yi. 

y  k. 

Mean  yj 

y 2. 

7k. 

Totals  and  means  are  defined  as  follows: 

Total  of  the  / th  sample:  y,.  =  E/=i  y ij- 
Overall  total:  y..  =  Ya=\  E"= l  y ij- 
Mean  of  the /th  sample:  v;  =  y,-  Jn. 
Overall  mean:  y  =  v  /  kn. 
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The  model  for  each  observation  vector  is 

y  ij  =  n  +  ai  +  Eij 

=  fjLj  +  Ejj ,  i  =  1,2, k;  j  —  1,2,...  ,  n .  (6.8) 

In  terms  of  the  p  variables  in  y  ij,  (6.8)  becomes 


so  that  the  model  for  the  rth  variable  (r  —  1,2,...  ,  p)  in  each  vector  y is 

3 Hjr  —  Hr  T  &ir  T  jr  —  Hi  r  T  £z'yY  ■ 

We  wish  to  compare  the  mean  vectors  of  the  k  samples  for  significant  differences. 
The  hypothesis  is,  therefore, 

Ho:  fi\  =  /JL2  =  ■  ■  ■  =  Hk  vs-  H\  :  at  least  two  /x's  are  unequal. 

Equality  of  the  mean  vectors  implies  that  the  k  means  are  equal  for  each  variable; 
that  is,  pi  r  —  P2  r  —  ■  ■  ■  —  Hkr  for  r  —  1,2,...  ,  p.  If  two  means  differ  for  just  one 
variable,  for  example,  H23  7^  M43,  then  Hq  is  false  and  we  wish  to  reject  it.  We  can 
see  this  by  examining  the  elements  of  the  population  mean  vectors: 


Thus  Hq  implies  p  sets  of  equalities: 


Mil 

=  M21  =  •  ■ 

■■  =  Hkl, 

M12 

=  M22  =  ' 

■■  =  Mfc2> 

Ml  p 

—  M2  p  = 

—  Hkp- 

All  p(k  —  1)  equalities  must  hold  for  Hq  to  be  true;  failure  of  only  one  equality  will 
falsify  the  hypothesis. 

In  the  univariate  case,  we  have  “between”  and  “within”  sums  of  squares  SSH  and 
SSE.  By  (6.3),  (6.5),  and  (6.6),  these  are  given  by 
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k  k  v2  v2 

ssh  =  „Y,<j,-yf  =  Y.yi-b 


i=l 
k  n 


i= 1 


i=l  ;=1  ij  i 


By  analogy,  in  the  multivariate  case,  we  have  “between”  and  “within”  matrices  H 
and  E,  defined  as 


k 

H  =  «  E^'-  -  y..)(y,-.  -  y ..)'  (6.9) 

1  =  1 


k  n 

e  =  E  H&j  ~  yi.Kyy  -  y».)'  (6-|(» 

i'  =  l  7  =  1 

ij  i 

The  p  x  p  “hypothesis”  matrix  H  has  a  between  sum  of  squares  on  the  diagonal  for 
each  of  the  p  variables.  Off-diagonal  elements  are  analogous  sums  of  products  for 
each  pair  of  variables.  Assuming  there  are  no  linear  dependencies  in  the  variables,  the 
rank  of  H  is  the  smaller  of  p  and  vh,  mini/;,  vh),  where  vh  represents  the  degrees 
of  freedom  for  hypothesis;  in  the  one-way  case  vh  =  k  —  1.  Thus  H  can  be  singular. 
The  p  x  p  “error”  matrix  E  has  a  within  sum  of  squares  for  each  variable  on  the 
diagonal,  with  analogous  sums  of  products  off-diagonal.  The  rank  of  E  is  p,  unless 
ve  is  less  than  p. 

Thus  H  has  the  form 


(  SSHn 

sph12  • 

•  SPHip 

sph12 

SSH22  • 

••  sph2/, 

v  SPHlp 

SPH2/,  ■ 

•  SSH  pp 

where,  for  example, 


SSH*  =  „£>,, -5^  =  £f-§, 

i= 1  i 


SPH12  =  n  ^(yu  -  y..i)(y,2  -  y..2)  =  E 


1=1 


yi.iy/,2 


n 


y..\y..2 

kn 


(6.11) 
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In  these  expressions,  the  subscript  1  or  2  indicates  the  first  or  second  variable.  Thus, 
for  example,  y,  2  is  the  second  element  in  y,  : 


/  yj.  1  \ 

yi.2 


y  i.  = 


V  yip  / 

The  matrix  E  can  be  expressed  in  a  form  similar  to  (6. 11): 


E  = 


/  SSEn  SPE12  •••  SPE lp  \ 
SPE12  SSE22  •  •  •  SPE2/, 


V  SPElp  SPE2/; 


SSE pp  J 


where,  for  example, 

k  n 


sse22  e  J>,'2  -  yi.2)1  =  E  yfj2  -  E  '-f  ’ 


i= 1  1=1 

A:  n 


(6.12) 


SPE12  =  E  E(>,6'  _  yf.i)(yii2  -  y«.2)  =  E  W1  W2  _  E 

1=1  ;=1  ij  i 


yuyi.2 


Note  that  the  elements  of  E  are  sums  of  squares  and  products,  not  variances  and 
covariances.  To  estimate  2,  we  use  Spi  =  E /(nk  —  k),  so  that 


E 


E 


nk  —  k 


=  2. 


6.1.3  Wilks’  Test  Statistic 

The  likelihood  ratio  test  of  Hq  :  \jl\  =  pc2  =  ■  ■  ■  =  Mr  is  given  by 

A  =  — — — , 

|E  +  H| 


(6.13) 


which  is  known  as  Wilks’  A.  (It  has  also  been  called  Wilks’  U .)  We  reject  Hq  if 
A  <  A a,p,vH,vE-  Note  that  rejection  is  for  small  values  of  A.  Exact  critical  val¬ 
ues  Jy.<x,P,vH,vE  f°r  Wilks’  A  are  found  in  Table  A.9,  taken  from  Wall  (1967).  The 
parameters  in  Wilks’  A  distribution  are 


p  —  number  of  variables  (dimension), 
vh  =  degrees  of  freedom  for  hypothesis, 
ve  =  degrees  of  freedom  for  error. 
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Wilks’  A  compares  the  within  sum  of  squares  and  products  matrix  E  to  the  total 
sum  of  squares  and  products  matrix  E+H.  This  is  similar  to  the  univariate  /-’-statistic 
in  (6.6)  that  compares  the  between  sum  of  squares  to  the  within  sum  of  squares.  By 
using  determinants,  the  test  statistic  A  is  reduced  to  a  scalar.  Thus  the  multivariate 
information  in  E  and  H  about  separation  of  mean  vectors  jq  ,  y2  ,  . . .  ,  y*  is  chan¬ 
neled  into  a  single  scale,  on  which  we  can  determine  if  the  separation  of  mean  vectors 
is  significant.  This  is  typical  of  multivariate  tests  in  general. 

The  mean  vectors  occupy  a  space  of  dimension  s  =  min {p,  vh),  and  within  this 
space  various  configurations  of  these  mean  vectors  are  possible.  This  suggests  the 
possibility  that  another  test  statistic  may  be  more  powerful  than  Wilks’  A.  Competing 
test  statistics  are  discussed  in  Sections  6.1.4  and  6.1.5. 

Some  of  the  properties  and  characteristics  of  Wilks’  A  are  as  follows: 


1.  In  order  for  the  determinants  in  (6.13)  to  be  positive,  it  is  necessary  that 
VE  >  P  • 

2.  For  any  MANOVA  model,  the  degrees  of  freedom  Vfj  and  ve  are  always  the 
same  as  in  the  analogous  univariate  case.  In  the  balanced  one-way  model,  for 
example,  vh  =  k  —  1  and  ve  =  k(n  —  1). 

3.  The  parameters  p  and  vh  can  be  interchanged;  the  distribution  of  Ap,„^,V£  is 
the  same  as  that  of  hVHpVE+VH-p. 

4.  Wilks’  A  in  (6.13)  can  be  expressed  in  terms  of  the  eigenvalues  A.i,  X2, . . .  .  av 
of  E_1H,  as  follows: 


1 

1  +  A; 


(6.14) 


The  number  of  nonzero  eigenvalues  of  E-1H  is  s  =  mini p,  vh),  which  is 
the  rank  of  H.  The  matrix  HE- 1  has  the  same  eigenvalues  as  E"  1 H  (see  Sec¬ 
tion  2.11.5)  and  could  be  used  in  its  place  to  obtain  A.  However,  we  prefer 
E-1H  because  we  will  use  its  eigenvectors  later. 

5.  The  range  of  A  is  0  <  A  <  1 ,  and  the  test  based  on  Wilks’  A  is  an  inverse 
test  in  the  sense  that  we  reject  Hq  for  small  values  of  A.  If  the  sample  mean 
vectors  were  equal,  we  would  have  H  =  O  and  A  =  |E|/|E  +  0|  =  1.  On 
the  other  hand,  as  the  sample  mean  vectors  become  more  widely  spread  apart 
compared  to  the  within-sample  variation,  H  becomes  much  “larger”  than  E, 
and  A  approaches  zero. 

6.  In  Table  A. 9,  the  critical  values  decrease  for  increasing  p.  Thus  the  addition  of 
variables  will  reduce  the  power  unless  the  variables  contribute  to  rejection  of 
the  hypothesis  by  producing  a  significant  reduction  in  A. 

7.  When  vh  =  1  or  2  or  when  p  =  1  or  2,  Wilks’  A  transforms  to  an  exact  F- 
statistic.  The  transformations  from  A  to  F  for  these  special  cases  are  given  in 
Table  6.1.  The  hypothesis  is  rejected  when  the  transformed  value  of  A  exceeds 
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Table  6.1.  Transformations  of  Wilks’  A  to  Exact  Upper  Tail  E-Tests 


Parameters 

Statistic  Having 

Degrees  of 

P ,  vH 

E-Distribution 

Freedom 

Any  p,  vH  =  1 

1  —  A  vE  —  p+1 

A  p 

P,  vE  ~  P  +  1 

Any  p,  vH  —  2 

1  —  A  Ve  —  P  ~\~  1 

Va  p 

2 p,  2{ve  -  p  +  1) 

p  =  1,  any  vH 

1  -  A  vE 

A  vH 

ve 

p  =  2,  any  vH 

i  -  Va  ve  -  l 

Va  vh 

2vh,  2(ve  -  1) 

the  upper  a -level  percentage  point  of  the  F -distribution,  with  degrees  of  free¬ 
dom  as  shown. 

8.  For  values  of  p  and  vh  other  than  those  in  Table  6.1,  an  approximate  F- 
statistic  is  given  by 


1  —  A1</r  df2 
A Vt  dfj"’ 


with  dfi  and  df2  degrees  of  freedom,  where 


(6.15) 


1 

dfi  —  pvh,  df2  =  wt  -  ~(pvH  -  2), 


1 

w  =  vE  +  VH  -  ~(p  +  VH  +  1), 


t 


P2yH 


P1  +  V 


When  pv/f  =  2,  t  is  set  equal  to  1.  The  approximate  F  in  (6.15)  reduces  to  the 
exact  E-values  given  in  Table  6.1,  when  either  vE  or  p  is  1  or  2. 

A  (less  accurate)  approximate  test  is  given  by 


X2  =  ~b>E  ~  \{p  ~  vh  +  l)]lnA,  (6.16) 

which  has  an  approximate  ^-distribution  with  pi'n  degrees  of  freedom.  We 
reject  Hq  if  /  2  >  x2  .  This  approximation  is  accurate  to  three  decimal  places 
when  p2  +  <  j/,  where  f  =  vE  —  j(p  —  vh  +  1). 

9.  If  the  multivariate  test  based  on  A  rejects  II q.  it  could  be  followed  by  an  E-test 
as  in  (6.6)  on  each  of  the  p  individual  y’s.  We  can  formulate  a  hypothesis  com¬ 
paring  the  means  across  the  k  groups  for  each  variable,  namely,  Hor :  fi\r  — 
pLir  =  ■  ■  ■  =  Ukr ,  r  =  1.2,...  ,  p.  Tt  does  not  necessarily  follow  that  any 
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of  the  /-’-tests  on  the  p  individual  variables  will  reject  the  corresponding  Hq,-. 
Conversely,  it  is  possible  that  one  or  more  of  the  F’s  will  reject  Hq ,■  when  the 
A-test  accepts  Ho-  In  either  case,  where  the  multivariate  test  and  the  univari¬ 
ate  tests  disagree,  we  use  the  multivariate  test  result  rather  than  the  univariate 
results.  This  is  similar  to  the  relationship  between  Z2-tests  and  “-tests  shown 
in  Figure  5.2. 

In  the  three  bivariate  samples  plotted  in  Figure  6.1,  we  illustrate  the  case 
where  A  rejects  Hq:  p\  =  p.2  =  /X3,  but  the  F’s  accept  both  of  Hq,-  :  /i  \r  = 
P2r  =  P3 r,  >'  =  1,2,  that  is,  for  vi  and  yi.  There  is  no  significant  separation 
of  the  three  samples  in  either  the  vi  or  y'2  direction  alone.  Other  follow-up 
procedures  are  given  in  Sections  6. 1 .4  and  6.4. 

10.  The  Wilks’  A-test  is  the  likelihood  ratio  test.  Other  approaches  to  test  con¬ 
struction  lead  to  different  tests.  Three  such  tests  are  given  in  Sections  6.1.4 
and  6.1.5. 


6.1.4  Roy’s  Test 

In  the  union-intersection  approach,  we  seek  the  linear  combination  Zij  —  a'y jj  that 
maximizes  the  spread  of  the  transformed  means  Zi.  =  a'y ,•  relative  to  the  within- 
sample  spread  of  points.  Thus  we  seek  the  vector  a  that  maximizes 


n  L/=ifa-  ~  Z.)2/{k  -  1) 

E/= 1  E'j=l  (Ziy  -  Zi.)1/ (kn  -  k) 


(6.17) 


9 

8 

7 

yi  6 

5 

4 

3 

0  2  4  6  8  10 


J2 


Figure  6.1.  Three  samples  with  significant  Wilks’  A  but  nonsignificant  F’s. 
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which,  by  analogy  to  si  —  a'Sa  in  (3.55),  can  be  written  as 

a'Ha /(k  -  1) 

~  a'Ea /(kn  -  k) ' 


(6.18) 


This  is  maximized  by  ai,  the  eigenvector  corresponding  to  X i ,  the  largest  eigenvalue 
of  E_1H  (see  Section  8.4.1),  and  we  have 


max  F  = 

a 


a'jHai /(k-  1) 
a'jEai/lAm  —  k) 


k{n  -  1) 

— ; — t— 


(6.19) 


Since  maxa  F  in  (6.19)  is  maximized  over  all  possible  linear  functions,  it  no 
longer  has  an  E-distribution.  To  test  Hq  :  p  \  —  fio  =  •  •  •  =  Pk  based  on  X  i,  we 
use  Roy’s  union-intersection  test ,  also  called  Roy’s  largest  root  test.  The  test  statistic 
is  given  by 


X\ 

9  = - — . 

1  +  M 


(6.20) 


Critical  values  for  9  are  given  in  Table  A.  10  (Pearson  and  Hartley  1972,  Pillai  1964, 
1965).  We  reject  Ho :  pi  =  pi  =  •  •  •  =  Pk  if  9  >  9a,s,m,N-  The  parameters  s,  m, 
and  N  are  defined  as 


s  =  min(vH,p),  m  =  \{\vH  -  p\  -  1),  N  —  j(vE  —  p  —  1). 


For  s  =  l,  use  (6.34)  and  (6.37)  in  Section  6.1.7  to  obtain  an  E-test. 

The  eigenvector  ai  corresponding  to  Aj  is  used  in  the  discriminant  function,  z  = 
ajy.  Since  this  is  the  function  that  best  separates  the  transformed  means  Zi.  =  a'y(- , 
i  =  1, 2,  . . .  ,  k  [relative  to  the  within-sample  spread,  see  (6.17)],  the  coefficients  flu, 
«12, . . .  ,  flip  in  the  linear  combination  /  =  aj  v  can  be  examined  for  an  indication  of 
which  variables  contribute  most  to  separating  the  means.  The  discriminant  function 
is  discussed  further  in  Sections  6.1.8  and  6.4  and  in  Chapter  8. 

We  do  not  have  a  satisfactory  E-approximation  for  9  or  A.i ,  but  an  “upper  bound” 
on  E  that  is  provided  in  some  software  programs  is  given  by 


E  = 


(ve  —  d  —  l)Aj 


(6.21) 


with  degrees  of  freedom  d  and  vp  —  d  —  1 ,  where  d  =  max( p,  vh).  The  term 
upper  bound  indicates  that  the  E  in  (6.21)  is  greater  than  the  “true  E”;  that  is,  E  > 
Fd.vp-d- 1  •  Therefore,  we  feel  safe  if  Hq  is  accepted  by  (6.21);  but  if  rejection  of  Hq 
is  indicated,  the  information  is  virtually  worthless. 

Some  computer  programs  do  not  provide  eigenvalues  of  nonsymmetric  matrices, 
such  as  E  1 II.  However,  the  eigenvalues  of  E-1H  are  the  same  as  the  eigenvalues 
of  the  symmetric  matrices  (E1/2)-1H(E1/2)-1  and  (U')-1HU-1,  where  E1/2  is  the 
square  root  matrix  of  E  given  in  (2.1 12)  and  U'U  =  E  is  the  Cholesky  factorization 
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of  E  (Section  2.7).  We  demonstrate  this  for  the  Cholesky  approach.  We  first  multiply 
the  defining  relationship  (E-1H  —  /.I)a  =  0  by  E  to  obtain 


(H  -  A.E)a  =  0.  (6.22) 

Then  substituting  E  =  U'U  into  (6.22),  multiplying  by  (U')_1,  and  inserting 
U“ 1 U  =  I,  we  have 


(H  -  kU'U)a  =  0, 

(U,)“1(H-lU,U)a=  (U'r'0  =  0, 

[(U')_1H  -  ^U]U“*Ua  =  0, 

[(U,)“1HU“1  -  7I]Ua  =  0.  (6.23) 

Thus  (U')_1HU_1  has  the  same  eigenvalues  as  E  1 II  and  has  eigenvectors  of  the 
form  Ua,  where  a  is  an  eigenvector  of  E-1H.  Note  that  (U,)“1HU“1  is  positive 
semidefinite,  and  thus  /,,  >  0  for  all  eigenvalues  of  E_1H. 


6.1.5  Pillai  and  Lawley-Hotelling  Tests 

There  are  two  additional  test  statistics  for  Ho :  /X|  =  /jl2  —  •  •  •  =  F-k  based  on  the 
eigenvalues  /,  ] ,  Xi, . . .  ,XS  of  IT  1 II.  The  Pillai  statistic  is  given  by 


V(s)  =  tr[(E  +  H)_1H]  =  V  — — . 

1  +  h 


(6.24) 


We  reject  Hq  for  V/<,s)  >  V)}'1.  The  upper  percentage  points,  vlf1 ,  are  given  in  Table 
A.l  1  (Schuurmann,  Krishnaiah,  and  Chattopadhyay  1975),  indexed  by  s,  m ,  and  N, 
which  are  defined  as  in  Section  6.1.4  for  Roy’s  test.  For  s  =  1,  use  (6.34)  and  (6.37) 
in  Section  6.1.7  to  obtain  an  E-test. 

Pillai’s  test  statistic  in  (6.24)  is  an  extension  of  Roy’s  statistic  0  =  li/(l+A.i).If 
the  mean  vectors  do  not  lie  in  one  dimension,  the  information  in  the  additional  terms 
Xj /(I  +  Xj),  i  =  2,  3, . . .  ,s,  may  be  helpful  in  rejecting  Hq. 

For  parameter  values  not  included  in  Table  A.l  1,  we  can  use  an  approximate  F- 
statistic: 


Fi 


(2N  +  S+  l)V(s) 

(2m  +  i  +  1)($  —  VW)  ’ 


(6.25) 


which  is  approximately  distributed  as  Fs(7m+s+ 1 ),.v(2V+.s+ 1 ) •  Two  alternative  F- 
approximations  are  given  by 


F2  = 


s(ve  -  VH  +  5)y(s) 
pvn(s  —  V(J)) 


(6.26) 
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with  pvf]  and  s(vE  —  vh  +  s)  degrees  of  freedom,  and 


( ve  -  p  +  s)V(i) 
d(s  -  y  W) 


(6.27) 


with  sd  and  s(vE  —  p  +  s)  degrees  of  freedom,  where  d  =  max(/?,  v# ).  It  can  be 
shown  that  f  3  in  (6.27)  is  the  same  as  F\  in  (6.25). 

The  Lawley-Hotelling  statistic  (Lawley  1938,  Hotelling  1951)  is  defined  as 


U(s)  =  tr(E_1H)  =  (6.28) 

i=l 

and  is  also  known  as  Hotelling’s  generalized  T~ -statistic  (see  a  comment  at  the  end 
of  Section  6.1.7).  Table  A. 12  (Davis  1970a,  b,  1980)  gives  upper  percentage  points 
of  the  test  statistic 


VH 


We  reject  Hq  for  large  values  of  the  test  statistic.  Note  that  in  Table  A.  12,  p  <  \>h 
and  p  <  ve-  If  p  >  vh,  use  ( vh ,  p,  ve  +  vh~ p)  in  place  of  (p,  vh,  ve)-  (This  same 
pattern  in  the  parameters  is  found  in  Wilks’  A;  see  property  3  in  Section  6.1.3.)  If 
vh  —  1  and  p  >  1,  use  the  relationship  =  T~/ve  [see  (6.39)  in  Section  6.1.7], 
For  other  values  of  the  parameters  not  included  in  Table  A.  12,  we  can  use  an  approx¬ 
imate  /’-statistic: 


Fi  =  - 

C 


which  is  approximately  distributed  as  Faj,,  where 


ci  +  2 

a  =  pvH,  b  =  4+- - 

n  —  J 

_  (VE  +  VH  -  p  -  1)  (VE  -  1) 
(ve  -  P  -  3)(V£  -  p) 


a(b  -  2) 
b(vE  -  p  -  1)’ 


Alternative  /-’-approximations  are  given  by 


_  2(sN+  l)l/w 
s2(2m  +  s  +  1)  ’ 

with  s(2m  +  s  +  1)  and  2 (sN  +  1)  degrees  of  freedom,  and 

[S{VE  _  VH  _  i)  +  2]I/W 
F3  = - , 


(6.29) 


(6.30) 


spvH 


(6.31) 
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with  pvn  and  s(ve  —  vh  —  1)  degrees  of  freedom.  If  p  <  vh ,  then  Fj,  in  (6.31)  is 
the  same  as  Fi  in  (6.30). 


6.1.6  Unbalanced  One-Way  MANOYA 

The  balanced  one-way  model  can  easily  be  extended  to  the  unbalanced  case,  in  which 
there  are  observation  vectors  in  the  ;th  group.  The  model  in  (6.8)  becomes 

y ij  =  /u.  +  a,-  +  Ejj  =  Pi  +  Ejj,  i  —  1,2,,..  ,k;  j  —  1,2, ,  n,. 

The  mean  vectors  become  y,-  =  Y^)'=  \  yij/ni  and  y  —  l/i=  1 12"/=  i  y ij/N,  where 
N  —  l/j-i  >U-  Similarly,  the  total  vectors  are  defined  as  y,.  =  y ij  and  y..  = 

Eu  y  ij- The  h  and  E  matrices  are  calculated  as 

k  k  1  1 

H  =  ~ y ..)(yf.  - y..)'  =  — y^-.  -  Ty-y!.-  (6-32) 

U  U  n>  N 

k  rii 

=  EE  yo-yo-  - 

>=i  1=1  <=i  i=i 

Wilks’  A  and  the  other  tests  have  the  same  form  as  in  Sections  6. 1.3-6. 1.5,  using  H 
and  E  from  (6.32)  and  (6.33).  In  each  test  we  have 

k 

vH  —  k  —  1 ,  ve  —  N  —  k  —  —  k. 

i=i 

Note  that  N  =  n,  differs  from  A  used  as  a  parameter  in  Roy’s  and  Pillai’s  tests 
in  Sections  6.1.4  and  6.1.5. 


k  tii 

E  =  E  E(y  o  -  yJ(yo  -  y».)' 


6.1.7  Summary  of  the  Four  Tests  and  Relationship  to  T2 

We  compare  the  four  test  statistics  in  terms  of  the  eigenvalues  /.  ]  >  Xj  >  ...  >  /.v 
of  E~*H,  where  s  —  min(v/f,  p): 


Pillai:  Y(i) 
Lawley-Hotelling:  Uis) 
Wilks’  lambda:  A 

6 


s 


E 


1  +  Xj 


n 


i 

1  +  Xj 


kl 

1  +  X  i 


Roy’s  largest  root: 
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Note  that  for  all  four  tests  we  must  have  ve  >  p.  As  noted  in  Section  6.1.3  and  else¬ 
where,  p  is  the  number  of  variables,  vh  is  the  degrees  of  freedom  for  the  hypothesis, 
and  ve  is  the  degrees  of  freedom  for  error. 

Why  do  we  use  four  different  tests?  All  four  are  exact  tests;  that  is,  when  Hq  is 
true,  each  test  has  probability  a  of  rejecting  Hq.  However,  the  tests  are  not  equivalent, 
and  in  a  given  sample  they  may  lead  to  different  conclusions  even  when  Hq  is  true; 
some  may  reject  Hq  while  others  accept  Hq.  This  is  due  to  the  multidimensional 
nature  of  the  space  in  which  the  mean  vectors  p\,  p2,  ■■■  ,  p k  he.  A  comparison  of 
power  and  other  properties  of  the  tests  is  given  in  Section  6.2. 

When  vh  —  1,  then  s  is  also  equal  to  1,  and  there  is  only  one  nonzero  eigen¬ 
value  A.i.  In  this  case,  all  four  test  statistics  are  functions  of  each  other  and  give 
equivalent  results.  In  terms  of  0,  for  example,  the  other  three  become 


CD  _  a.,  —  ° 

1-0’ 

(6.34) 

w  =  e, 

(6.35) 

a  =  i  -e. 

(6.36) 

In  the  case  of  v h  =  1,  all  four  statistics  can  be  transformed  to  an  exact  F  using 

F  =  VE  ~  P+  1I/(1),  (6.37) 

P 

which  is  distributed  as  FPtVE-p+i. 

The  equivalence  of  all  four  test  statistics  to  Hotelling’s  T 2  when  vh  =  1  was 
noted  in  Section  5.6.1.  We  now  demonstrate  the  relationship  T 2  =  {n\  +ri2  —  2)U^ 
in  (5.17).  For  H  and  E,  we  use  (6.32)  and  (6.33),  which  allow  unequal  n,,  since  we 
do  not  require  n \  —  ni  in  T2.  In  this  case,  with  only  two  groups,  H  =  ni  (y i  ~ 

y  )(y j-  —  y  )'  can  be  expressed  as 

h  =  n'"2  (yt.  - y2.)(yi.  -  fiY  =  c(yL  - y2.)(yt.  - y2.)'.  (6-38) 

n  i  +  n 2 


where  c  =  n\n.2/(n\  +  n2).  Then  by  (6.34)  and  (6.28),  t/(  1-)  becomes 
Ua)  =ki=  tr(E_1H) 

=  tr[cE_1(yj  -  y2  )(yx  -  y2.)']  [by  (6.38)] 

=  ctr[(yL  -yj./E"1^!.  -y2.)]  [by  (2.97)] 
c  ./  E  \-1  _ 

=  — ; - r(yt.  -  y2.)  — ; - r  (yi.  -  y2.) 

n  i  +  n2  —  2  \n\  +  n2  —  2/ 

T 2 


n\  +  n2  -  2’ 


(6.39) 
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since  E/(ni  + 112  —  2)  =  Spi  (see  Section  5.4.2).  Equations  (5.16),  (5.18),  and  (5.19) 
follow  immediately  from  this  result  using  (6.34)-(6.36). 

Because  of  the  direct  relationship  in  (6.39)  between  U(l>  and  T2  for  the  case  of 
two  groups,  the  Lawley-Hotelling  statistic  is  often  called  the  generalized  T2- 
statistic. 


Example  6.1.7.  In  a  classical  experiment  carried  out  from  1918  to  1934,  apple  trees 
of  different  rootstocks  were  compared  (Andrews  and  Herzberg  1985,  pp.  357-360). 
The  data  for  eight  trees  from  each  of  six  rootstocks  are  given  in  Table  6.2.  The 
variables  are 


y  1  =  trunk  girth  at  4  years  (mm  x  100), 
yi  —  extension  growth  at  4  years  (m), 

V3  =  trunk  girth  at  15  years  (mm  x  100), 

y4  =  weight  of  tree  above  ground  at  15  years  (lb  x  1000). 


The  matrices  H,  E,  and  E  +  H  are  given  by 


H 


E 


E  +  H 


/ 

.074 

.537 

.332  . 

,208  \ 

.537 

4.200 

2.355  1. 

,637 

.332 

2.355 

6.114  3. 

,781 

’ 

V 

.208 

1.637 

3.781  2. 

,493  j 

/ 

.320 

1.697 

.554 

.217 

\ 

1.697 

12.143 

4.364 

2.110 

.554 

4.364 

4.291 

2.482 

V 

.217 

2.110 

2.482 

1.723 

J 

/ 

.394 

2.234 

.886 

.426 

\ 

2.234 

16.342 

6.719 

3.747 

.886 

6.719 

10.405 

6.263 

V 

.426 

3.747 

6.263 

4.216 

J 

In  this  case,  the  mean  vectors  represent  six  points  in  four-dimensional  space.  We 
can  compare  the  mean  vectors  for  significant  differences  using  Wilks’  A  as  given  by 
(6.13): 


|E| 

|E  +  H| 


.6571 

4.2667 


.154. 


In  this  case,  the  parameters  of  the  Wilks’  A  distribution  are  p  —  4,  vh  —  6  —  1  =  5, 
and  \’e  =  6(8  —  1 )  =  42.  We  reject  Hq  :  fji\  =  /jl2  =  ■  ■  ■  =  ft 6  because 


A  =  .154  <  A  05,4,5,40  =  -455. 


Table  6.2.  Rootstock  Data 


Publisher's  Note: 

Permission  to  reproduce  this  image 
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Note  the  use  of  ve  =  40  in  place  of  ve  =  42.  This  is  a  conservative  approach  that 
allows  a  table  value  to  be  used  without  interpolation. 

To  obtain  an  approximate  F,  we  first  calculate 


t 

w 

dfi 


2„2 


pAV 


H 


p2  +  4 


4252 


42  +  52  -  5 


3.3166, 


VE  +  VH  —  k(p  +  i  ’h  +  1)  —  42  +  5  —  ^  (4  +  5  +  1)  —  42, 
pvji  —  4(5)  =  20,  df2  =  wt  —  j(pvH  —  2)  =  130.3. 


Then  the  approximate  F  is  given  by  (6.15), 

l-Ax/?df2  1  —  (.154)1/3-3166  130.3 

F  = - -  =  - - - - - =  4  937 

Ax/f  dfi  (.154)1/3-3166  20  '  ’ 

which  exceeds  A.ooi, 20,120  =  2.53,  and  we  reject  Hq. 

The  four  eigenvalues  of  E_1H  are  1.876,  .791,  .229,  and  .026.  With  these  we  can 
calculate  the  other  three  test  statistics.  For  Pillai’s  statistic  we  have,  by  (6.24), 


y6) 


V 


1  +  X, 


1.305. 


To  find  a  critical  value  for  V (s  1  in  Table  A.l  1,  we  need 

s  =  min(v//,  p)  =  4,  m  —  \(\vh  —  p\  —  1)  =  0, 
N=11{ve-p-  1)  =  18.5. 


Then  Vqj  =  .645  (by  interpolation).  Since  1.305  >  .645,  we  reject  Hq. 
For  the  Lawley-Hotelling  statistic  we  obtain,  by  (6.28), 


S 

U(s)  =  ^TXi=  2.921. 
1=1 

To  make  the  test,  we  calculate  the  test  statistic 


ve  42 

—  U(s)  =  —(2.921)  =  24.539. 
vH  5 

The  .05  critical  value  for  veU('s) /vh  is  given  in  Table  A. 12  as  7.6188  (using  ve  — 
40),  and  we  therefore  reject  Hq. 
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Roy’s  test  statistic  is  given  by  (6.20)  as 

A,  1.876 

0  = - —  = - =  .652, 

1  +  A.i  1+1.876 

which  exceeds  the  .05  critical  value  .377  obtained  (by  interpolation)  from  Table  A.  10, 
and  we  reject  Ho.  □ 

6.1.8  Measures  of  Multivariate  Association 

In  multiple  regression,  a  measure  of  association  between  the  dependent  variable  y 
and  the  independent  variables  x\,  xi, ...  ,  xq  i s  given  by  the  squared  multiple  corre¬ 
lation 


r-,  regression  sum  of  squares 
total  sum  of  squares 


Similarly,  in  one-way  univariate  ANOVA,  Fisher’s  correlation  ratio  rp  is  defined  as 

2  between  sum  of  squares 

rj  —  - . 

total  sum  of  squares 


This  is  a  measure  of  model  fit  similar  to  R 2  and  gives  the  proportion  of  variation  in 
the  dependent  variable  y  attributable  to  differences  among  the  means  of  the  groups.  It 
answers  the  question,  How  well  can  we  predict  y  by  knowing  what  group  it  is  from? 
Thus  rp  can  be  considered  to  be  a  measure  of  association  between  the  dependent 
variable  y  and  the  grouping  variable  i  associated  with  //,■  or  a,-  in  the  model  (6.2). 
In  fact,  if  the  grouping  variable  is  represented  by  k  —  1  dummy  variables  (also  called 
indicator,  or  categorical,  variables),  then  we  have  a  dependent  variable  related  to 
several  independent  variables  as  in  multiple  regression. 

A  dummy  variable  takes  on  the  value  1  for  sampling  units  in  a  group  (sample)  and 
0  for  all  other  sampling  units.  (Values  other  than  0  and  1  could  be  used.)  Thus  for  k 
samples  (groups),  the  k  —  1  dummy  variables  are 


1  if  sampling  unit  is  in  i  th  group, 
0  otherwise. 


i  =  1,2, ...  ,k-  1. 


Only  k  —  1  dummy  variables  are  needed  because  if  x\  =  xo  —  ■■■  —  Xk-  \  =  0,  the 
sampling  unit  must  be  from  the  kth  group  (see  Section  1 1.6.2  for  an  illustration).  The 
dependent  variable  y  can  be  regressed  on  the  k—  1  dummy  variables  xi ,  xj, . . .  ,  Xk- 1 
to  produce  results  equivalent  to  the  usual  ANOVA  calculations. 

In  (one-way)  MANOVA,  we  need  to  measure  the  strength  of  the  association 
between  several  dependent  variables  and  several  independent  (grouping)  variables. 
Various  measurements  of  multivariate  association  have  been  proposed.  Wilks  (1932) 
suggested  a  “generalized  q2”: 

MANOVA  ip  =  q\  =  1  -  A, 


(6.41) 
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based  on  the  use  of  |E|  and  |E  +  H|  as  generalizations  of  sums  of  squares.  We  use 
1  —  A  because  A  is  small  if  the  spread  in  the  means  is  large. 

We  now  consider  an  rf  based  on  Roy’s  statistic,  0.  We  noted  in  Section  6.1.4 
that  the  discriminant  function  is  the  linear  function  z  =  a  j  v  that  maximizes  the 
spread  of  the  means  z,i-  =  a^y,- ,  i  =  1,2, ,  k,  where  ai  is  the  eigenvector  of 
E_1H  corresponding  to  the  largest  eigenvalue  '/,  \ .  We  measure  the  spread  among  the 
means  by  SSH  =  n  Y^=i  (Zi.  ~  z..)2,  divided  by  the  within-sample  spread  SSE  = 
J Zijizij  ~  Zi.r ■  The  maximum  value  of  this  ratio  is  given  by  i,\  [see  (6.19)].  Thus 

SSH(z) 

1  SSE(z)  ’ 


and  by  (6.20), 


M  =  SSH(g) 

1  +  M  SSE(z)  +  SSH(z) ' 

Hence  6  serves  directly  as  a  measure  of  multivariate  association: 


It  can  be  shown  that  the  square  root  of  this  quantity. 


(6.42) 


(6.43) 


* = Vr£cr-  <6-44' 

is  the  maximum  correlation  between  a  linear  combination  of  the  p  dependent  vari¬ 
ables  and  a  linear  combination  of  the  k  —  1  dummy  group  variables  (see  Sec¬ 
tion  11.6.2).  This  type  of  correlation  is  often  called  a  canonical  correlation  (see 
Chapter  1 1 )  and  is  defined  for  each  eigenvalue  A.i ,  A.2, . . .  ,  Xs  as  r;  =  /(I  +  1/). 

We  now  consider  some  measures  of  multivariate  association  suggested  by  Cramer 
and  Nicewander  (1979)  and  Muller  and  Peterson  (1984).  It  is  easily  shown  (Sec¬ 
tion  11.6.2)  that  A  can  be  expressed  as 

A=riT^=n<i-'A  <645' 

i=i 1  +  A'  i=i 

where  rj  =  /,;/(!  +  /./)  is  the  ;th  squared  canonical  correlation  described  ear¬ 
lier.  The  geometric  mean  of  a  set  of  positive  numbers  a\,  ai, . . .  ,a„  is  defined  as 
(ai<72  •  •  -  an)1/"-  Thus  A1/*  is  the  geometric  mean  of  the  (1  —  rf)’s,  and  another 
measure  of  multivariate  association  based  on  A,  in  addition  to  that  in  (6.41),  is 


Aa  =  1-  Al/S. 


(6.46) 
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In  fact,  as  noted  by  Muller  and  Peterson,  the  /-’-approximation  given  in  (6.15), 

1  -  A1/'  df2 

F  — - - 

A1/'  dfi 

is  very  similar  to  the  univariate  F-statistic  (10.31)  for  testing  significance  in  multiple 
regression, 


/?2/(df  model) 

(1  -  ft2  )/(df  error)  ’ 


(6.47) 


based  on  R2  in  (6.40). 

Pillai’s  statistic  is  easily  expressible  as  the  sum  of  the  squared  canonical  correla¬ 
tions: 


yW  = 


S 


E 


A./ 

1  + 


(6.48) 


and  the  average  of  the  rf  can  be  used  as  a  measure  of  multivariate  association: 


Ap 


EL  i 


yM 


In  terms  of  Ap  the  A -approximation  given  in  (6.26)  becomes 


(6.49) 


F2  = 


_ Ap/pVH _ 

(1  -  Ap)/s(VE  -VH+s)’ 


(6.50) 


which  has  an  obvious  parallel  to  (6.47). 

For  the  Lawley-Hotelling  statistic  U(s> ,  a  multivariate  measure  of  association  can 
be  defined  as 


Alh  = 


U(s)/s 
1  +  /s' 


(6.51) 


If  s  =  1,  (6.51)  reduces  to  (6.43).  In  fact,  (6.43)  is  a  special  case  of  (6.51)  because 
U(s)/s  =  E/=i  A;  /s  is  the  arithmetic  average  of  the  A ’s.  It  is  easily  seen  that  the  F- 
approximation  Fj  for  the  Lawley-Hotelling  statistic  given  in  (6.31)  can  be  expressed 
in  terms  of  Alh  as 


_ Alh/ PVh _ 

(1-Alh)/[s(v£-vh -!)+!]’ 


(6.52) 


which  resembles  (6.47). 
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Example  6.1.8.  We  illustrate  some  measures  of  association  for  the  root-stock  data 
in  Table  6.2: 


ri2A  =  1  -  A  =  .846, 
nl  =  6  =  .652, 

Aa  =  1  -  A1/4  =  1  -  (,154)1/4  =  .374, 

AP  =  Vis)/s  =  1.305/4=  .326, 

„LH  =  =  2i>2'A  =  .422. 

1  +  f/W/s  1+2.921/4 

There  is  a  wide  range  of  values  among  these  measures  of  association. 


□ 


6.2  COMPARISON  OF  THE  FOUR  MANOVA  TEST  STATISTICS 

When  Ho:  ijl\  —  /jli  —  ■  ■  ■  =  m  is  true,  all  the  mean  vectors  are  at  the  same 
point.  Therefore,  all  four  MANOVA  test  statistics  have  the  same  Type  I  error  rate, 
a,  as  noted  in  Section  6.1.7;  that  is,  all  have  the  same  probability  of  rejection  when 
Ho  is  true.  However,  when  Ho  is  false,  the  four  tests  have  different  probabilities  of 
rejection.  We  noted  in  Section  6.1.7  that  in  a  given  sample  the  four  tests  need  not 
agree,  even  if  Hq  is  true.  One  test  could  reject  Hq  and  the  others  accept  Hq,  for 
example. 

Historically,  Wilks’  A  has  played  the  dominant  role  in  significance  tests  in 
MANOVA  because  it  was  the  first  to  be  derived  and  has  well-known  x2-  and  F- 
approximations.  It  can  also  be  partitioned  in  certain  ways  we  will  find  useful  later. 
However,  it  is  not  always  the  most  powerful  among  the  four  tests.  The  probability  of 
rejecting  Ho  when  it  is  false  is  known  as  the  power  of  the  test. 

In  univariate  ANOVA  with  p  =  1,  the  means  m,  p,2,  ■  ■  -  ,  ///.-  can  be  uniquely 
ordered  along  a  line  in  one  dimension,  and  the  usual  E-test  is  uniformly  most  pow¬ 
erful.  In  the  multivariate  case,  on  the  other  hand,  with  p  >  1,  the  mean  vectors 
are  points  in  s  —  mini p,  vh)  dimensions.  We  have  four  tests,  not  one  of  which  is 
uniformly  most  powerful.  The  relative  powers  of  the  four  test  statistics  depend  on 
the  configuration  of  the  mean  vectors  fi\,  fi2,  ■  ■  ■  ■  IH  in  the  s -dimensional  space,  A 
given  test  will  be  more  powerful  for  one  configuration  of  mean  vectors  than  another. 

If  vh  <  p,  then  s  =  v#  and  the  mean  vectors  lie  in  an  s -dimensional  subspace 
of  the  /^-dimensional  space  of  the  observations.  The  points  may,  in  fact,  occupy  a 
subspace  of  the  s  dimensions.  For  example,  they  may  be  confined  to  a  line  (one 
dimension)  or  a  plane  (two  dimensions).  This  is  illustrated  in  Figure  6.2. 

An  indication  of  the  pattern  of  the  mean  vectors  is  given  by  the  eigenvalues  of 
E-1H.  If  there  is  one  large  eigenvalue  and  the  others  are  small,  the  mean  vectors  lie 
close  to  a  line.  If  there  are  two  large  eigenvalues,  the  mean  vectors  lie  mostly  in  two 
dimensions,  and  so  on. 
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Figure  6.2.  Two  possible  configurations  for  three  mean  vectors  in  3-space. 


Because  Roy’s  test  uses  only  the  largest  eigenvalue  of  E_1H,  it  is  more  powerful 
than  the  others  if  the  mean  vectors  are  collinear.  The  other  three  tests  have  greater 
power  than  Roy’s  when  the  mean  vectors  are  diffuse  (spread  out  in  several  dimen¬ 
sions). 

In  terms  of  power,  the  tests  are  ordered  6  >  U<s>  >  A  >  V f  v  1  for  the  collinear 
case.  In  the  diffuse  case  and  for  intermediate  structure  between  collinear  and  diffuse, 
the  ordering  of  power  is  reversed,  VL'>  >  A  >  U>s>  >  6.  The  latter  ordering  also 
holds  for  accuracy  of  the  Type  I  error  rate  when  the  population  covariance  matri¬ 
ces  XuX2,...  ,  St  are  not  equal.  These  orderings  are  comparisons  of  power.  For 
actual  computation  of  power  in  a  given  experimental  setting  or  to  find  the  sample 
size  needed  to  yield  a  desired  level  of  power,  see  Rencher  (1998,  Section  4.4). 

Generally,  if  group  sizes  are  equal,  the  tests  are  sufficiently  robust  with  respect 
to  heterogeneity  of  covariance  matrices  so  that  we  need  not  worry.  If  the  «,•’ s  are 
unequal  and  we  have  heterogeneity,  then  the  a -level  of  the  MANOVA  test  may  be 
affected  as  follows.  If  the  larger  variances  and  covariances  are  associated  with  the 
larger  samples,  the  true  cv-level  is  reduced  and  the  tests  become  conservative.  On  the 
other  hand,  if  the  larger  variances  and  covariances  come  from  the  smaller  samples, 
a  is  inflated,  and  the  tests  become  liberal.  Box’s  M-test  in  Section  7.3.2  can  be  used 
to  test  for  homogeneity  of  covariance  matrices. 

In  conclusion,  the  use  of  Roy’s  6  is  not  recommended  in  any  situation  except 
the  collinear  case  under  standard  assumptions.  In  the  diffuse  case  its  performance  is 
inferior  to  that  of  the  other  three,  both  when  the  assumptions  hold  and  when  they  do 
not.  If  the  data  come  from  nonnormal  populations  exhibiting  skewness  or  positive 
kurtosis,  any  of  the  other  three  tests  perform  acceptably  well.  Among  these  three, 
is  superior  to  the  other  two  when  there  is  heterogeneity  of  covariance  matrices. 
Indeed  V('s)  is  first  in  all  rankings  except  those  for  the  collinear  case.  However,  A 
is  not  far  behind,  except  when  there  is  severe  heterogeneity  of  covariance  matrices. 
It  seems  likely  that  Wilks'  A  will  continue  its  dominant  role  because  of  its  flexi- 
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bility  and  historical  precedence.  [For  references  for  this  section,  see  Rencher  (1998, 
Section  4.2).] 

In  practice,  most  MANOVA  software  programs  routinely  calculate  all  four  test 
statistics,  and  they  usually  reach  the  same  conclusion.  In  those  cases  when  they  differ 
as  to  acceptance  or  rejection  of  the  hypothesis,  one  can  examine  the  eigenvalues 
and  covariance  matrices  and  evaluate  the  conflicting  conclusions  in  light  of  the  test 
properties  discussed  previously. 

Example  6.2.  We  inspect  the  eigenvalues  of  E”  1 H  for  the  rootstock  data  of 
Table  6.2  for  an  indication  of  the  configuration  of  the  six  mean  vectors  in  a  four- 
dimensional  space.  The  eigenvalues  are  1.876,  .791,  .229,  .026.  The  first  eigenvalue, 
1.876,  constitutes  a  proportion 

1.876 

- =  .642 

1.876+  .791  +  .229  +  .026 

of  the  sum  of  the  eigenvalues.  Therefore,  the  first  eigenvalue  does  not  dominate  the 
others,  and  the  mean  vectors  are  not  collinear.  The  first  two  eigenvalues  account  for 
a  proportion 

L876+-791  913 
1.876+ ■■■  +  .026 

of  the  sum  of  the  eigenvalues,  and  thus  the  six  mean  vectors  lie  largely  in  two  dimen¬ 
sions.  Since  the  mean  vectors  are  not  collinear,  the  test  statistics  A,  V^,  and  U^s’ 
will  be  more  appropriate  than  6  in  this  case.  □ 


6.3  CONTRASTS 

As  in  Sections  6. 1.1-6. 1.5,  we  consider  only  the  balanced  model  where  n\  —  no  — 
■  ■  ■  =  nk  —  n.  We  begin  with  a  review  of  contrasts  in  the  univariate  setting  before 
moving  to  the  multivariate  case. 

6.3.1  Univariate  Contrasts 

A  contrast  in  the  population  means  is  defined  as  a  linear  combination 

S  —  c\ni  +  C2M2  H - F  CkUk,  (6.53) 

where  the  coefficients  satisfy 


k 

^c;  =  0.  (6.54) 

1=1 

An  unbiased  estimator  of  S  is  given  by 

s  =  ciyL  +  c2y2.  H - h  ckyk  . 


(6.55) 
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The  sample  means  yi  were  defined  in  (6.1).  Since  the  y,-  ’s  are  independent  with 
variance  o2 /n,  the  variance  of  S  is 


2  k 


Yar(S)  =  —J2cf, 


i= 1 


which  can  be  estimated  by 


MSE 


Ec?’ 


(6.56) 


/=! 


where  MSE  was  defined  in  (6.6)  and  (6.7)  as  SSE tk(n  —  1). 

The  usual  hypothesis  to  be  tested  by  a  contrast  is 

Hq  :  S  =  cim  +  C2/U2  H - H  ««  =  0. 

For  example,  suppose  k  —  4  and  that  a  contrast  of  interest  to  the  researcher  i  s  3 // 1  — 
M2  —  M3  —  M4-  ^  this  contrast  is  set  equal  to  zero,  we  have 

3mi  =  M2  +  M3  +  M4  or  Ml  =  5  (M2  +  M3  +  M4), 

and  the  experimenter  is  comparing  the  first  mean  with  the  average  of  the  other  three. 
A  contrast  is  often  called  a  comparison  among  the  treatment  means. 

Assuming  normality,  Hq  :  <5  =  0  can  be  tested  by 


t  — 


0 


(6.57) 


which  is  distributed  as  tVE.  Alternatively,  since  t2  —  F,  we  can  use 


(EL  ciy.  f 

MSEEti  cf/n 


»<E/  ayj.)2/  Ef  cf 

MSE 


(6.58) 


which  is  distributed  as  F\  Ve.  The  numerator  of  (6.58)  is  often  referred  to  as  the  sum 
of  squares  for  the  contrast. 

If  two  contrasts  S  =  JT  at  pit  and  y  =  h,  /x,  are  such  that  aib,  =  0,  the 
contrasts  are  said  to  be  orthogonal.  The  two  estimated  contrasts  can  be  written  in 
the  form  J2i  aiJi.  =  aA  and  E/  b>yi.  =  lrE  where  y  =  (jx  ,  y2_, ,  ykY .  Then 
X!/  aibj  —  a  h  =  0,  and  by  the  discussion  following  (3.14),  the  coefficient  vectors  a 
and  b  are  perpendicular. 
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When  two  contrasts  are  orthogonal,  the  two  corresponding  sums  of  squares  are 
independent.  In  fact,  for  k  treatments,  we  can  find  k  —  I  orthogonal  contrasts  that 
partition  the  treatment  sum  of  squares  SSH  into  k  —  1  independent  sums  of  squares, 
each  with  one  degree  of  freedom.  In  the  unbalanced  case  (Section  6.1.6),  orthogonal 
contrasts  such  that  JT  at  hi  =  0  do  not  partition  SSH  into  k  —  1  independent  sums 
of  squares.  For  a  discussion  of  contrasts  in  the  unbalanced  case,  see  Rencher  (1998, 
Sections  4.8.2  and  4.8.3;  2000,  Section  14.2.2). 


6.3.2  Multivariate  Contrasts 

There  are  two  usages  of  contrasts  in  a  multivariate  setting.  We  have  previously 
encountered  one  use  in  Section  5.9.1,  where  we  considered  the  hypothesis  Hq  :  Cm  — 
0  with  Cj  =  0.  Each  row  of  C  sums  to  zero,  and  C/x  is  therefore  a  set  of  contrasts 
comparing  the  elements  m,  H2,  ■  ■  ■  ,  fxp  °f  M  with  each  other.  In  this  section,  on  the 
other  hand,  we  consider  contrasts  comparing  several  mean  vectors,  not  the  elements 
within  a  vector. 

A  contrast  among  the  population  mean  vectors  is  defined  as 

S  =  c\fx\  +  C2ti2  H - b  CkfXk,  (6.59) 

where  X!?=i  c>  —  0-  An  unbiased  estimator  of  8  is  given  by  the  corresponding  con¬ 
trast  in  the  sample  mean  vectors: 

8  =  ciyL  +  c2y2.  H - b  ckykr  (6.60) 


The  sample  mean  vectors  yj  ,  y2  , . . .  ,yk  as  defined  in  Section  6.1.2  were  assumed 
to  be  independent  and  to  have  common  covariance  matrix,  cov(y, )  =  X/n.  Thus  the 
covariance  matrix  for  8  is  given  by 


oX  ,2  02 

cov(5)  —  cj - b  c\ - 1 - b  c\  — 

n  n  n 


which  can  be  estimated  by 


(6.61) 


where  Spi  =  E/v >e  is  an  unbiased  estimator  of  2. 

The  hypothesis  Hq  :  8  —  0  or  //o :  c\  /x\  +  c2/u.2  +  •  •  •  +  ck  fxk  =  0  makes  com¬ 
parisons  among  the  population  mean  vectors.  For  example,  fx\  —  2/x2  +  fxk  =  0  is 
equivalent  to 


M2  =  y(M  1  +  M3), 
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and  we  are  comparing  p.2  to  the  average  of  m  and  Of  course  this  implies  that 
every  element  of  po  must  equal  the  corresponding  element  of  \(.fn  +  p.3): 


(  1121  ^ 
M  22 

— 

\  M2 p  ) 

(Mil  +  M3t)  ^ 
(M12  +  M32) 

( M 1  p  +  M3 p)  ) 


Under  appropriate  multivariate  normality  assumptions,  Ho :  c  1  /x  1  +  to  M2  H —  •  + 
CkiM:  —  0  or  //() :  f>  —  0  can  be  tested  with  the  one-sample  7’2-statistic 


T2 


S 


(6.62) 


which  is  distributed  as  T2  V/ .  In  the  one-way  model  under  discussion  here,  1 >e  = 
k(n  —  1). 

An  equivalent  test  of  Hq  can  be  made  with  Wilks’  A.  By  analogy  with  the  numer¬ 
ator  of  (6.58),  the  hypothesis  matrix  due  to  the  contrast  is  given  by 


H,  = 


(6.63) 


The  rank  of  Hi  is  1,  and  the  test  statistic  is 


A  = 


|E| 

lE  +  Hjf 


(6.64) 


which  is  distributed  as  A;,  \  VE.  The  other  three  MANOVA  test  statistics  can  also 
be  applied  here  using  the  single  nonzero  eigenvalue  of  E_1Hi.  Because  v#  =  1 
in  this  case,  all  four  MANOVA  statistics  and  T2  give  the  same  results;  that  is,  all 
five  transform  to  the  same  E-value  using  the  formulations  in  Section  6.1.7.  If  k  —  1 
orthogonal  contrasts  are  used,  they  partition  the  H  matrix  into  k  —  1  independent 
matrices  Hi,  H2,  ...  ,Ha._i.  Each  H,  matrix  has  one  degree  of  freedom  because 
rank  (H;  )  =  1. 


Example  6.3.2.  We  consider  the  following  two  orthogonal  contrasts  for  the  root- 
stock  data  in  Table  6.2: 
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2  -1  -1  -1  -1  2, 

1  0  0  0  0  -1. 

The  first  compares  /X|  and  (i f  with  the  other  four  mean  vectors.  The  second  compares 
vs.  /jl(, .  Thus  Hqi  :  2(i\  —  jn?  —  /u,3  —  /X4  —  JH5  +  2 (if,  =  0  can  be  written  as 

H0l :  2(a.\  +  2 (if  =  (i2  +  M3  +  M4  +  Ms- 

Dividing  both  sides  by  4  to  express  this  in  terms  of  averages,  we  obtain 

Hot :  2 (Mi  +  M6>  =  £(M 2  +  M3  +  M4  +  Ms)- 
Similarly,  the  hypothesis  for  the  second  contrast  can  be  expressed  as 

^02 :  Ml  =  M6 


The  mean  vectors  are  given  by 


yi. 

y2. 

y3. 

y4. 

y5. 

ys. 

1.14 

1.16 

1.11 

1.10 

1.08 

1.04 

2.98 

3.11 

2.82 

2.88 

2.56 

2.21 

3.74 

4.52 

4.46 

3.91 

4.31 

3.60 

.87 

1.28 

1.39 

1.04 

1.18 

.74 

For  the  first  contrast,  we  obtain  Hi  from  (6.63)  as 


Then 


=  y^(2yi.  -  y2. - +  2y6.)(2yx  -y2 - +  2y6.)' 


/  -.095  \ 


-.978 

-2.519 


(-.095,  -.978,  -2.519,  -1.680) 


\  -1.680  / 


/  .006  .062  .160  .106  \ 

.062  .638  1.642  1.095 

.160  1.642  4.229  2.820 

v  .106  1.095  2.820  1.881  / 


|E| 

|E  +  Hi| 


.6571 
1 .4824 


.443, 


which  is  less  than  A .05,4,1,40  =  -779  from  Table  A. 9.  We  therefore  reject  Hq\  . 
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To  test  the  significance  of  the  second  contrast,  we  have 


Then 


h2  =  —  (yi.  -y6.)(yi.  -  ye.)' 


8 

2 


(  .101  \ 
.762 
.142 


(.101,  .762,  .142,  .136) 


V  .136  ) 


/  .041 

.309 

.058 

.055  \ 

.309 

2.326 

.435 

.415 

.058 

.435 

.081 

.078 

v  .055 

.415 

.078 

.074 

lE 

1 

.6571 

=  .750, 

|E  +  H2| 

.8757 

which  is  less  than  A.o5,4,i,40  =  -779,  and  we  reject  Hq 2. 


□ 


6.4  TESTS  ON  INDIVIDUAL  VARIABLES  FOLLOWING  REJECTION 
OF  //0  BY  THE  OVERALL  MANOVA  TEST 

In  Section  6.1,  we  considered  tests  of  equality  of  mean  vectors,  Hq\  p\  —  p2  — 
■  •  •  =  pk,  which  implies  equality  of  means  for  each  of  the  p  variables: 

Hor  ■  Ml/'  =  M2r  =  •  •  •  =  Mfcr,  r  =  1,2,...  ,  p. 

This  hypothesis  could  be  tested  for  each  variable  by  itself  with  an  ordinary  univariate 
ANOVA  f’-test,  as  noted  in  property  9  in  Section  6.1.3.  For  example,  if  there  are 
three  mean  vectors, 


(  Mil  ^ 

(  M  21  ^ 

^  M31  ^ 

Ml  = 

M 12 

.  M2  = 

M  22 

M3  = 

M  32 

\  Fl p  ) 

\  F-2 p  ) 

\  F-3 p  ) 

we  have  Hq\  :  Mil  =  M21  =  M3i>  #02:  M12  =  M22  =  M 32,  ■  ■  •  ,  Hqp:  m ip  =  M 2P  — 
M3p-  Each  of  these  p  hypotheses  can  be  tested  with  a  simple  ANOVA  F- test. 

If  an  /-’-test  is  made  on  each  of  the  p  variables  regardless  of  whether  the  overall 
MANOVA  test  of  Hq:  Ml  =  M2  =  M3  rejects  Hq,  then  the  overall  a-level  will 
increase  beyond  the  nominal  value  because  we  are  making  p  tests.  As  in  Section  5.5, 
we  define  the  overall  a  or  experimentwise  error  rate  as  the  probability  of  rejecting 
one  or  more  of  the  p  univariate  tests  when  Hq  .  p.  1  =  pn  —  M3  is  true-  We  could 
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“protect”  against  inflation  of  the  experimentwise  error  rate  by  performing  tests  on 
individual  variables  only  if  the  overall  MANOVA  test  of  Ho  :  /x\  —  pLi  —  fx 3  rejects 
//().  In  this  procedure,  the  probability  of  rejection  for  the  tests  on  individual  variables 
is  reduced,  and  these  tests  become  more  conservative. 

Rencher  and  Scott  (1990)  compared  these  two  procedures  for  testing  the  individ¬ 
ual  variables  in  a  one-way  MANOVA  model.  Since  the  focus  was  on  a-levels,  only 
the  case  where  Hq  is  true  was  considered.  Specifically,  the  two  procedures  were  as 
follows: 

1.  A  univariate  F-test  is  made  on  each  variable,  testing  //q,  :  fi  ]r  =  /x —  •  •  •  — 
Ilk,-,  r  =  1,2,...  ,  p.  In  this  context,  the  p  univariate  tests  constitute  an  exper¬ 
iment  and  one  or  more  rejections  are  counted  as  one  experimentwise  error.  No 
multivariate  test  is  made. 

2.  The  overall  MANOVA  hypothesis  Hq:  /x\  =  fi2  =  •  •  •  =  fXk  is  tested  with 
Wilks’  A,  and  if  Ho  is  rejected,  p  univariate  /-’-tests  on  Ho\,  //02,  . . .  ,  Hqp 
are  carried  out.  Again,  one  or  more  rejections  among  the  /-’-tests  are  counted 
as  one  experimentwise  error. 

The  amount  of  intercorrelation  among  the  multivariate  normal  variables  was  indi¬ 
cated  by  XwLi(l  A')//7,  where  Aj,  A2,  ■  ■  ■  ,  Ap  are  the  eigenvalues  of  the  population 
correlation  matrix  Pp .  Note  that  JV  ( 1  /a,  )/ p  —  1  for  the  uncorrelated  case  (Pp  =  I) 
and  ( I  /A,- ) / p  >  1  for  the  correlated  case  (Pp  ^  I).  When  the  variables  are  highly 
intercorrelated,  one  or  more  of  the  eigenvalues  will  be  near  zero  (see  Section  4.1.3), 
and  (1/A.,- )/p  will  be  large. 

The  error  rates  of  these  two  procedures  were  investigated  for  several  values  of 
p,  n,  k,  and  £V(1  /A,-)/p,  where  p  is  the  number  of  variables,  n  is  the  number  of 
observation  vectors  in  each  group,  k  is  the  number  of  groups,  and  ( 1  /A.,-  )/p  is  the 
measure  of  intercorrelation  defined  above.  In  procedure  1,  the  probability  of  rejecting 
one  or  more  univariate  tests  when  Ho  is  true  varied  from  .09  to  .3 1  ( a  was  .05  in  each 
test).  Such  experimentwise  error  rates  are  clearly  unacceptable  when  the  nominal 
value  of  a  is  .05.  However,  this  approach  is  commonly  used  when  the  researcher 
is  not  familiar  with  the  MANOVA  approach  or  does  not  have  access  to  appropriate 
software. 

Table  6.3  contains  the  error  rates  for  procedure  2,  univariate  /-’-tests  following  a 
rejection  by  Wilks’  A.  The  values  range  from  .022  to  .057,  comfortably  close  to  the 
target  value  of  .05.  No  apparent  trends  or  patterns  are  seen;  the  values  do  not  seem 
to  depend  on  p,  k,  n,  or  the  amount  of  intercorrelation  as  measured  by  JV  (1/A,-  )/p. 
Thus  when  univariate  tests  are  made  only  following  a  rejection  of  the  overall  test,  the 
experimentwise  error  rate  is  about  right. 

Based  on  these  results,  we  recommend  making  an  overall  MANOVA  test  followed 
by  /-’-tests  on  the  individual  variables  (at  the  same  a-level  as  the  MANOVA  test)  only 
if  the  MANOVA  test  leads  to  rejection  of  Hq. 

Another  procedure  that  can  be  used  following  rejection  of  the  MANOVA  test 
is  an  examination  of  the  discriminant  function  coefficients.  The  discriminant  func¬ 
tion  was  defined  in  Section  6.1.4  as  z  —  ajy,  where  aj  is  the  eigenvector  asso- 
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Table  6.3.  Experimentwise  Error  Rates  for  Procedure  2:  Univariate  E-Tests  Following 
Rejection  by  Wilks’  A 


n 

p 

Z,dA  i)/p 

1 

10 

100 

1000 

il 

u> 

k  =  5 

II 

U> 

k  =  5 

<r> 

II 

k  =  5 

II 

k  =  5 

5 

3 

.037 

.022 

.039 

.022 

.029 

5 

5 

.041 

.037 

.039 

.057 

.038 

.035 

.027 

.039 

5 

7 

.030 

.042 

.035 

.045 

.039 

.037 

.026 

.048 

10 

3 

.047 

.041 

.030 

.033 

.043 

.045 

.026 

.032 

10 

5 

.047 

.037 

.026 

.049 

.041 

.026 

.027 

.029 

10 

7 

.034 

.054 

.037 

.047 

.047 

.040 

.040 

.044 

20 

3 

.050 

.043 

.032 

.054 

.048 

.039 

.040 

.032 

20 

5 

.045 

.055 

.042 

.051 

.037 

.044 

.050 

.043 

20 

7 

.055 

.051 

.029 

.040 

.033 

.051 

.039 

.033 

dated  with  the  largest  eigenvalue  A.  i  of  E_1H.  Additionally,  there  are  other  dis¬ 
criminant  functions  using  eigenvectors  corresponding  to  the  other  eigenvalues.  Since 
the  first  discriminant  function  maximally  separates  the  groups,  we  can  examine  its 
coefficients  for  the  contribution  of  each  variable  to  group  separation.  Thus  in  z  — 
flnyi  +  anyi  +  ■  ■  ■  +  fli pyp,  if  «i2  is  larger  than  the  other  a i/s,  we  believe  y2  con¬ 
tributes  more  than  any  of  the  other  variables  to  separation  of  groups.  A  method  of 
standardization  of  the  «i/s  to  adjust  for  differences  in  the  scale  among  the  variables 
is  given  in  Section  8.5. 

The  information  in  a\r  (from  z  =  aj  v)  about  the  contribution  of  yr  to  separation 
of  the  groups  is  fundamentally  different  from  the  information  provided  in  a  univariate 
E-test  that  considers  yr  alone  (see  property  9  in  Section  6.1.3).  The  relative  size  of 
air  shows  the  contribution  of  yr  in  the  presence  of  the  other  variables  and  takes 
into  account  (1)  the  correlation  of  yr  with  the  others  y’s  and  (2)  the  contribution 
of  yr  to  Wilks’  A  above  and  beyond  the  contribution  of  the  other  y’s.  In  contrast, 
the  individual  E-test  on  yr  ignores  the  presence  of  the  other  variables.  Because  we 
are  primarily  interested  in  the  collective  behavior  of  the  variables,  the  discriminant 
function  coefficients  provide  more  pertinent  information  than  the  tests  on  individual 
variables.  For  a  detailed  analysis  of  the  effect  of  each  variable  in  the  presence  of 
other  variables,  see  Rencher  (1993;  1998,  Section  4.1.6). 

Huberty  (1975)  compared  the  standardized  coefficients  to  some  correlations  that 
can  be  shown  to  be  related  to  individual  variable  tests  (see  Section  8.7.3).  In  a  lim¬ 
ited  simulation,  the  discriminant  coefficients  were  found  to  be  more  valid  than  the 
univariate  tests  in  identifying  those  variables  that  contribute  least  to  separation  of 
groups.  Considerable  variation  was  found  from  sample  to  sample  in  ranking  the  rel¬ 
ative  potency  of  the  variables. 


Example  6.4.  In  Example  6.1.7,  the  hypothesis  Ho'-  fx\  —  1x2  —  —  fX(,  was 

rejected  for  the  rootstock  data  of  Table  6.2.  We  can  therefore  test  the  four  individual 
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variables  using  the  .05  level  of  significance.  For  the  first  variable,  y\  =  4-year  trunk 
girth,  we  obtain  the  following  ANOVA  table: 


Source  Sum  of  Squares  df 

Rootstocks  .073560  5 

Error  .319988  42 

Total  .393548  47 


Mean  Square 

.014712 

.007619 


F 

1.93 


For  F  —  1.93  the  p- value  is  .1094,  and  we  do  not  reject  Hq.  For  the  other  three 
variables  we  have 


Variable  F  p  -Value 

y2  =  4-year  extension  growth  2.91  .024 

y3  =  15 -year  trunk  girth  11.97  <  .0001 

y4  =  15-year  weight  12.16  <.0001 

Thus  for  three  of  the  four  variables,  the  six  means  differ  significantly.  We  examine 
the  standardized  discriminant  function  coefficients  for  this  set  of  data  in  Chapter  8 
(Problem  8.12).  □ 


6.5  TWO-WAY  CLASSIFICATION 

We  consider  only  balanced  models,  where  each  cell  in  the  model  has  the  same  num¬ 
ber  of  observations,  n.  For  the  unbalanced  case  with  unequal  cell  sizes,  see  Rencher 
(1998,  Section  4.8). 


6.5.1  Review  of  Univariate  Two-Way  ANOYA 

In  the  univariate  two-way  model,  we  measure  one  dependent  variable  y  on  each 
experimental  unit.  The  balanced  two-way  fixed-effects  model  with  factors  A  and 
B  is 


yijk  =  B  + a  i  +  fij  +  Yij  +  Si  jk  (6.65) 

=  jMj  +  Sijk,  (6.66) 

i=l,2,  j  =  l,2,...,b,  k=l,2,...,n, 

where  a,  is  the  effect  (on  yijk)  of  the  ith  level  of  A,  is  the  effect  of  the  yth  level  of 
B,  yij  is  the  corresponding  interaction  effect,  and  //  is  the  population  mean  for  the 
ith  level  of  A  and  the  / th  level  of  B.  In  order  to  obtain  /-’-tests,  we  further  assume 
that  the  Sjjk  s  are  independently  distributed  as  N( 0,  a2). 

Let  Jij  =  /  tLij  /b  be  the  mean  at  the  i  th  level  of  A  and  define  JI  j  and  Jl 

similarly.  Then  if  we  use  side  conditions  JT  a,-  =  ^  •  /3j  =  Yij  =  Yij  — 
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the  effect  of  the  /  th  level  of  A  can  be  defined  as  a,-  =  //,  — /r  ,  with  similar  definitions 
of  Pj  and  Yij.  We  can  show  that  a,-  =  0  if  a;  =  JZt  —  JZ  as  follows: 

a  a 

^2  ai  =  -  TO  =  'Yh'fii.  -  aJZ.. 

i= 1  i  =  l  i 

—  aJZ  —  aJZ  =  0. 

Many  texts  recommend  that  the  interaction  A5  be  tested  first,  and  that  if  it  is 
found  to  be  significant,  then  the  main  effects  should  not  be  tested.  However,  with  the 
side  conditions  imposed  earlier  (side  conditions  are  not  necessary  in  order  to  obtain 
tests),  the  effect  of  A  is  defined  as  the  average  effect  over  the  levels  of  B,  and  the 
effect  of  B  is  defined  similarly.  With  this  definition  of  main  effects,  the  tests  for  A 
and  B  make  sense  even  if  A  B  is  significant.  Admittedly,  interpretation  requires  more 
care,  and  the  effect  of  a  factor  may  vary  if  the  number  of  levels  of  the  other  factor  is 
altered.  But  in  many  cases  useful  information  can  be  gained  about  the  main  effects 
in  the  presence  of  interaction. 

We  illustrate  the  preceding  statement  that  a ,  =  /I,  —  JZ  represents  the  effect  of 
the  i  th  level  of  A  averaged  over  the  levels  of  B.  Suppose  A  has  two  levels  and  B  has 
three.  We  represent  the  means  of  the  six  cells  as  follows: 


B 


1 

2 

3 

Mean 

1 

Mil 

M12 

M-13 

Mi. 

2 

M21 

M22 

M  23 

M2. 

Mean 

JZ.  1 

M.  2 

JZ  3 

JZ.. 

The  means  of  the  rows  (corresponding  to  levels  of  A)  and  columns  (levels  of  B) 
are  also  given.  Then  a;  =  JZ )  —  JZ  can  be  expressed  as  the  average  of  the  effect  of 
the  /  th  level  of  A  at  the  three  levels  of  B .  For  example, 

a l  =  j[(Mli  -  M.i)  +  (M12  -  JZ. 2)  +  (M13  -  JZ. 3)] 

=  j(Mn  +  M12  +  M13)  —  j(M.  1  +  B.2  +  M.3)  =  Mi.  —  M..- 

To  estimate  a,-,  we  can  use  a,  =Jj  —y  ,  with  similar  estimates  for  Pj  and  Yij- 
The  notation  y,  indicates  that  ytjk  is  averaged  over  the  levels  of  j  and  k  to  obtain 
the  mean  of  all  nb  observations  at  the  /th  level  of  A,  namely,  y,-  =  ytjk/nb. 
The  means  y  j  ,  y!;- ,  and  y  have  analogous  definitions. 

To  construct  tests  for  the  significance  of  factors  A  and  B  and  the  interaction  AB, 
we  use  the  usual  sums  of  squares  and  degrees  of  freedom  as  shown  in  Table  6.4. 
Computational  forms  of  the  sums  of  squares  can  be  found  in  many  standard  (univari¬ 
ate)  methods  texts. 


188 


MULTIVARIATE  ANALYSIS  OF  VARIANCE 


Table  6.4.  Univariate  Two-Way  Analysis  of  Variance 


Source 

Sum  of  Squares 

df 

A 

SSA  =  n/>E,(y,.-yJ2 

a  —  1 

B 

SSB  =  na  JA  (y  j  -  y  )2 

b-  1 

AB 

SSAB  '"V,  (T,,  y.  y  ;  !  V  )' 

(a-  mb-  1) 

Error 

SSE  =Ey*(y«*-7y.)2 

ab(n  —  1) 

Total 

SST  =  Ei/*0’y*  —  A..)2 

abn  —  1 

The  sums  of  squares  in  Table  6.4  (for  the  balanced  model)  have  the  relationship 
SST  =  SSA  +  SSB  +  SSAB  +  SSE, 

and  the  four  sums  of  squares  on  the  right  are  independent.  The  sums  of  squares 
are  divided  by  their  corresponding  degrees  of  freedom  to  obtain  mean  squares  MSA, 
MSB,  MSAB,  and  MSE.  For  the  fixed  effects  model,  each  of  MSA,  MSB,  and  MSAB 
is  divided  by  MSE  to  obtain  an  E-test.  In  the  case  of  factor  A ,  for  example,  the 
hypothesis  can  be  expressed  as 

Hoa  :  a i  —  a2  —  ■  ■  ■  —  ota  —  0, 

and  the  test  statistic  is  E  =  MSA/MSE,  which  is  distributed  as  Ea_ i,a6(„_i). 

In  order  to  define  contrasts  among  the  levels  of  each  main  effect,  we  can  conve¬ 
niently  use  the  model  in  the  form  given  in  (6.66), 


yijk  —  Hij  +  Sijk- 

A  contrast  among  the  levels  of  A  is  defined  as  Xw=i  c/M;  >  where  E;  c,-  =  0.  An  esti¬ 
mate  of  the  contrast  is  given  by  E,  c,y(-  ,  with  variance  a2  E;  cf/nb,  since  each  y;- 
is  based  on  nb  observations  and  the  y,-  ’s  are  independent.  To  test  Hq  :  E;  c,/I,  =  0, 
we  can  use  an  E-statistic  corresponding  to  (6.58), 


E  = 


^(ELic,-y,.)2/E?=ic? 

MSE 


(6.67) 


with  1  and  ve  degrees  of  freedom.  To  preserve  the  experimentwise  error  rate,  signifi¬ 
cance  tests  for  more  than  one  contrast  could  be  carried  out  in  the  spirit  of  Section  6.4; 
that  is,  contrasts  should  be  chosen  prior  to  seeing  the  data,  and  tests  should  be  made 
only  if  the  overall  E-test  for  factor  A  rejects  Hoa- 

Contrasts  E  /  (:jb.  j  among  the  levels  of  B  are  tested  in  an  entirely  analogous 
manner. 


6.5.2  Multivariate  Two-Way  MANOVA 

A  balanced  two-way  fixed-effects  MANOVA  model  for  p  dependent  variables  can 
be  expressed  in  vector  form  analogous  to  (6.65)  and  (6.66); 
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y  ijk  =  P  +  a,  +  /}  j  +  y ij  +  Ejjk  =  Wj  +  e  ijk,  (6.68) 

i  =  1,2,...  , a,  j  =  \, 2, ...  ,b,  k  =  1,  2, . . .  ,  n, 

where  a,  is  the  effect  of  the  ;  th  level  of  A  on  each  of  the  p  variables  in  v,/ k ,  J8  / 
is  the  effect  of  the  jth  level  of  B,  and  y,j  is  the  AB  interaction  effect.  We  use  side 
conditions  J^i  a‘  —  12/  Pi  =  Hi  Jij  =  12 j  T ij  =  0  and  assume  the  Ejjk’s  are 
independently  distributed  as  Np  (0,2).  Under  the  side  condition  a,-  =  0,  the 
effect  of  A  is  averaged  over  the  levels  of  B:  that  is,  a,  =  Jij  —  Jl  ,  where  Jij  = 
Y2  j  Pi  j  lb  and  Jl  =  /  p,  j  /ab.  There  are  similar  definitions  for  /3  /  and  y(  / . 

As  in  the  univariate  usage,  the  mean  vector  y ,•  indicates  an  average  over  the  sub¬ 
scripts  replaced  by  dots,  that  is,  y,-  =  12  jk  Yijk/nb.  The  meansy  ,  y^- ,  andy  have 
analogous  definitions:  y  ^  =  J^ik  hjklna,  ytj.  =  12k  yijk/n,y...  =  Y.ijk  yijk/n»h- 
The  sum  of  squares  and  products  matrices  are  given  in  Table  6.5.  Note  that  the 
degrees  of  freedom  in  Table  6.5  are  the  same  as  in  the  univariate  case  in  Table  6.4. 
For  the  two-way  model  with  balanced  data,  the  total  sum  of  squares  and  products 
matrix  is  partitioned  as 


T  —  H/i  +  Hg  +  Hab  +  E. 


(6.69) 


The  structure  of  any  of  the  hypothesis  matrices  is  similar  to  that  of  H  in  (6.1 1). 
For  example,  H  \  has  on  the  diagonal  the  sum  of  squares  for  factor  A  for  each  of  the 
p  variables.  The  off-diagonal  elements  of  Ha  are  corresponding  sums  of  products 
for  all  pairs  of  variables.  Thus  the  rth  diagonal  element  of  Ha  corresponding  to  the 
rth  variable,  r  =  1,2,...  ,  p,  is  given  by 


hArr  =  nbJ2(yi..r  -  y...r )2  =  XI 


i= 1 


1=1 


yf..r 

nb 


y:..r 
nab ' 


(6.70) 


where  y,-  ,.  and  y  represent  the  rth  components  of  y,-  and  y  ,  respectively,  and 
yj  r  and  y...,-  are  totals  corresponding  to  y,  r  and  y  ,..  The  (r.v)th  off-diagonal  ele¬ 
ment  of  Ha  is 


hArs  =  n^XOr.r  -  y...r)(yi..s  -  y...s )  =  X 


yi..ryi..s  y...ry.. 


i—  1 


/=! 


nb 


nab 


(6.71) 


Table  6.5.  Multivariate  Two-Way  Analysis  of  Variance 


Source 

Sum  of  Squares  and  Products  Matrix 

df 

A 

Ha  =  nb  £,.(y;..  -  y...)(y,..  -  y ...)' 

a  —  1 

B 

h«  =  na  y,j( y.j.  -  yJiy.j.  -  y...)' 

b-  1 

AB 

Hab  =  n  V.,(y,,  -  y,  -  J  ,  +  y  ) 

x  (y ij.  -  y  -  y.j.  +  y...)' 

(a-  l)(b-  1) 

Error 

E  =  T,ijk(yuk  -  y,7.)(y;M  -  yy.)' 

ab(n  —  1) 

Total 

T  =  T,ijk(yijk  -  y...)(y m  -  y ...)' 

a  bn  —  1 
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From  (6.69)  and  Table  6.5,  we  obtain 


J?.  r  y  — 

h ABrr  —  E  — - - - bArr  ~  h Brr , 


hABrs  =  E 


n  nab 

yij.ryij.s  y...ry...s 

n  nab 


h Ars  ^Br 


(6.72) 


For  the  E  matrix,  computational  formulas  are  based  on  (6.69): 

e  =  t-ha-hs-hab. 


Thus  the  elements  of  E  have  the  form 


err 

Crs 


E>vW- 

ijk 


y 

— - hArr  —  I’Brr  ~  hABrr, 

nab 


El  yijkryijks 
ijk 


y...ry...s 

nab 


h Ars  h Brs  h ABrs ■ 


(6.73) 


The  hypotheses  matrices  for  interaction  and  main  effects  in  this  fixed-effects 
model  can  be  compared  to  E  to  make  a  test.  Thus  for  Wilks’  A,  we  use  E  to  test 
each  of  A,  B,  and  AB: 


Aa  = 

|E| 

is  A p,a—l,ab(n—l)i 

|E  +  Ha| 

A  B  — 

|E| 

|E  +  Hfi| 

is  A-p,b—\,ab(n—\)i 

Aab  = 

|E| 

E  +  H  ab 

J  is  A p,(a—l)(b—l),ab(n 

In  each  case,  the  indicated  distribution  holds  when  Ho  is  true.  To  calculate  the  other 
three  MANOVA  test  statistics  for  A,  B,  and  AB ,  we  use  the  eigenvalues  of  E-IHa, 
E_1Hb,  and  E-‘Hab. 

If  the  interaction  is  not  significant,  interpretation  of  the  main  effects  is  simpler. 
However,  the  comments  in  Section  6.5.1  about  testing  main  effects  in  the  presence 
of  interaction  apply  to  the  multivariate  model  as  well.  If  we  define  each  main  effect 
as  the  average  effect  over  the  levels  of  the  other  factor,  then  main  effects  can  be  tested 
even  if  the  interaction  is  significant.  One  must  be  more  careful  with  the  interpretation 
in  case  of  a  significant  interaction,  but  there  is  information  to  be  gained. 

By  analogy  with  the  univariate  two-way  ANOVA  in  Section  6.5.1,  a  contrast 
among  the  levels  of  factor  A  can  be  defined  in  terms  of  the  mean  vectors  as  fol¬ 
lows:  Ya=\  CfMi..  where  Hi  a  =  0  and  7 1,  =  B-ij/b-  Similarly,  cjJZ j 
represents  a  contrast  among  the  levels  of  B.  The  hypothesis  that  these  contrasts  are 
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0  can  be  tested  by  T 2  or  any  of  the  four  MANOVA  test  statistics,  as  in  (6.62),  (6.63), 
and  (6.64).  To  test  Hq  :  Xw  c' =  0,  for  example,  we  can  use 


T  2 


(6.74) 


which  is  distributed  as  T ^  V£  when  Hq  is  true.  Alternatively,  the  hypothesis  matrix 


Hi 


(6.75) 


can  be  used  in 


A  =  — — - , 

IE  +  HjI 

which,  under  Hq,  is  Ap ,ip,E,  with  \>e  =  ab{n  —  1)  in  the  two-way  model.  The  other 
three  MANOVA  test  statistics  can  also  be  constructed  from  E-1Hi.  All  five  test 
statistics  will  give  equivalent  results  because  v#  =  1. 

If  follow-up  tests  on  individual  variables  are  desired,  we  can  infer  from  Rencher 
and  Scott  (1990),  as  reported  in  Section  6.4,  that  if  the  MANOVA  test  on  factor  A 
or  B  leads  to  rejection  of  Hq,  then  we  can  proceed  with  the  univariate  /-’-tests  on  the 
individual  variables  with  assurance  that  the  experimentwise  error  rate  will  be  close 
to  a. 

To  determine  the  contribution  of  each  variable  in  the  presence  of  the  others,  we 
can  examine  the  first  discriminant  function  obtained  from  eigenvectors  of  E_1Ha 
or  E'Hb,  as  in  Section  6.4  for  one-way  MANOVA.  The  first  discriminant  function 
for  E-1Ha,  for  example,  is  z  =  a'v,  where  a  is  the  eigenvector  associated  with  the 
largest  eigenvalue  of  E_1Ha-  In  z  =  flivi  +  a2V2  +  •  •  •  +  apyp,  if  ar  is  larger  than 
the  other  a’s,  then  \y  contributes  more  than  the  other  variables  to  the  significance  of 
A  a-  (In  many  cases,  the  a/  s  should  be  standardized  as  in  Section  8.5.)  Note  that  the 
first  discriminant  function  obtained  from  E-1Ha  will  not  have  the  same  pattern  as 
the  first  discriminant  function  from  E  1 H  /^ .  This  is  not  surprising  since  we  expect 
that  the  relative  contribution  of  the  variables  to  separating  the  levels  of  factor  A  will 
be  different  from  the  relative  contribution  to  separating  the  levels  of  B . 

A  randomized  block  design  or  a  two-way  MANOVA  without  replication  can  eas¬ 
ily  be  analyzed  in  a  manner  similar  to  that  for  the  two-way  model  with  replication 
given  here;  therefore,  no  specific  details  will  be  given. 

Example  6.5.2.  Table  6.6  contains  data  reported  by  Posten  (1962)  and  analyzed  by 
Kramer  and  Jensen  (1970).  The  experiment  involved  a  2  x  4  design  with  4  repli¬ 
cations,  for  a  total  of  32  observation  vectors.  The  factors  were  rotational  velocity 
[A  |  (fast)  and  A 2  (slow)]  and  lubricants  (four  types).  The  experimental  units  were 
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Table  6.6.  Two-Way  Classification  of  Measurements  on 
Bar  Steel 


Lubricant 

A 

L1 

A 

l2 

y  1 

T2 

yi 

,V2 

Bi 

7.80 

90.4 

7.12 

85.1 

7.10 

88.9 

7.06 

89.0 

7.89 

85.9 

7.45 

75.9 

7.82 

88.8 

7.45 

77.9 

Bn 

9.00 

82.5 

8.19 

66.0 

8.43 

92.4 

8.25 

74.5 

7.65 

82.4 

7.45 

83.1 

7.70 

87.4 

7.45 

86.4 

S3 

7.28 

79.6 

7.15 

81.2 

8.96 

95.1 

7.15 

72.0 

7.75 

90.2 

7.70 

79.9 

7.80 

88.0 

7.45 

71.9 

S4 

7.60 

94.1 

7.06 

81.2 

7.00 

86.6 

7.04 

79.9 

7.82 

85.9 

7.52 

86.4 

7.80 

88.8 

7.70 

76.4 

32  homogeneous  pieces  of  bar  steel.  Two  variables  were  measured  on  each  piece  of 
bar  steel: 


y i  =  ultimate  torque, 

V2  =  ultimate  strain. 

We  display  the  totals  for  each  variable  for  use  in  computations.  The  numbers  inside 
the  box  are  cell  totals  (over  the  four  replications),  and  the  marginal  totals  are  for  each 
level  of  A  and  B : 


Totals  for  y\ 
Ai  A2 


Si 

30.61 

29.08 

59.69 

s2 

32.61 

31.34 

64.12 

S3 

31.79 

29.45 

61.24 

S4 

30.22 

29.32 

59.54 

125.40 

119.19 

244.59 

Totals  for  y2 
A\  An 


Si 

354.0 

327.9 

681.9 

s2 

344.7 

310.0 

654.7 

S3 

352.9 

305.0 

657.9 

S4 

355.4 

323.9 

679.3 

1407.0 

1266.8 

2673.8 

Using  computational  forms  for  hArr  in  (6.70),  the  (1,  1)  element  of  Ha  (corre¬ 
sponding  to  Vi )  is  given  by 
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,  (125. 40)2  +  (119. 19)2  (244. 59)2  ,  onc 

(4)  (4)  (4)  (4) (2) 

For  the  (2,  2)  element  of  II 4  (corresponding  to  yj),  we  have 

(1407.0)2  +  (1266.8)2  (2673.S)2 

rlA22  — - =  614.25. 

16  32 

For  the  (1,  2)  element  of  (corresponding  to  y\y2),  we  use  (6.71)  for  h^s  to 
obtain 


h  a  1 7  — 


(125.40)(1407.0)  +  (119.19)(1266.8)  (244.591(2673.8) 


=  27.208. 


1.205  27.208 

27.208  614.251 


We  obtain  H g  similarly: 
,  (59. 69)2  +  •  •  •  + 

nn\ I  =  - 

(4)  (2) 

,  (681.9)2  H - h 

"B  22  —  - 3 - 

o 

,  (59.69)  (68 1.9)  + 

I1B 12  =  - 

/  1.694  -9.8 

l  n  o/co  'ia  < 


h  AB22  = 
hAB\2  = 

H  AB  = 


(59. 69)2  +  •  •  •  +  (59.5412 

(244.5912 

(4)  (2) 

32 

(681. 9)2  H - +  (679. 3)2 

(2673.812  , 

8 

32 

(59.69)  (68 1 .9)  +  •  •  •  +  (59.54)  (679.3)  ( 

8 

/  1.694  -9.862,  \ 

^  -9.862  74.874  )  ' 

we  have,  by  (6.72), 

(30.61)2  +  •  •  •  +  (29. 32)2 

(244. 59)2 

4 

32 

(354. 0)2  +  •  •  •  +  (323.9)2 

(2673. 8)2 

=  -9.862, 


-  614.25  -  74.874  =  32.244, 


4  32 

(30.61)  (354.0)  +  •  •  •  +  (29.32)(323.9)  (244.591(2673.8) 

4  32 

-27.208  -  (-9.862)  =  1.585, 

/  .132  1.585  \ 

l  1.585  32.244  /  ' 
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The  error  matrix  E  is  obtained  using  the  computational  forms  given  for  err  and  ers 
in  (6.73).  For  example,  e\  \  and  eyi  are  computed  as 

(244  59 )2 

en  =  (7.80)2  +  (7.10)2  +  •  •  •  +  (7.70)2  -  —  -  1.205 

-1.694 -.132  =  4.897, 

(244.59)  (2673.8) 

e\2  =  (7.80) (90.4)  +  •  •  •  +  (7.70)(76.4)  -  - - ^ - 2  -  27.208 

-(-9.862)  -  1.585  =  -1.890. 


Proceeding  in  this  fashion,  we  obtain 

/  4.897  -1.890  \ 

\  -1.890  736.390  )  ’ 

with  vE  =  ab{n  -  1)  =  (2)  (4)  (4  -  1)  =  24. 

To  test  the  main  effect  of  A  with  Wilks’  A,  we  compute 


A  a 


|E| 

|E  +  HA| 


3602.2 

7600.2 


=  .474  <  A. os, 2, 1,24  =  -771, 


and  we  conclude  that  velocity  has  a  significant  effect  on  y\  or  yy  or  both. 
For  the  B  main  effect,  we  have 


A  b 


|E| 

|E  +  HB| 


3602.2 

5208.6 


=  .6916  >  A. 05, 2, 3, 24  =  -591. 


We  conclude  that  the  effect  of  lubricants  is  not  significant. 
For  the  AB  interaction,  we  obtain 


A  ab 


|E| 

|E  +  Hab| 


3602.2 

3865.3 


=  .932  >  A. 05, 2, 3, 24  =  -591. 


Hence  we  conclude  that  the  interaction  effect  is  not  significant. 

We  now  obtain  the  other  three  MANOVA  test  statistics  for  each  test.  For  A,  the 
only  nonzero  eigenvalue  of  E_1HA  is  1.1 10.  Thus 


yO) 


0 


1  +  k| 
ki 

1  +  7-1 


.526, 

.526. 


Uis)  =A.i  =  1.110, 


In  this  case,  all  three  tests  give  results  equivalent  to  that  of  Aa  because  u H  =  s  =  I . 
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For  B,vh  =  3  and  p  =  s  —  2.  The  eigenvalues  of  E  1 H#  are  .418  and  .020,  and 
we  obtain 


U(s) 

e 


s 


E 


1  +  X, 


.314, 


s 

=  -438’ 

i=i 


*.• 

1  +  A-i 


.295. 


VH 


3.502, 


With  s  —  2,  m  —  0,  and  N  —  10.5,  we  have  V1*)  =  .439  and  6.05  =  .364.  The  .05 
critical  value  of  veU^'/vh  is  5.1799.  Thus  V^),  U^s\  and  0  lead  to  acceptance  of 
Hq,  as  does  A.  Of  the  four  tests,  0  appears  to  be  closer  to  rejection.  This  is  because 
X\/(X\  +  ^2)  =  -418/ (.418  +  .020)  =  .954,  indicating  that  the  mean  vectors  for 
factor  B  are  essentially  collinear,  in  which  case  Roy’s  test  is  more  powerful.  If  the 
mean  vectors  y  j  ,  y  2  ,  V  3  ,  and  y  4  for  the  four  levels  of  B  were  a  little  further  apart, 
we  would  have  a  situation  in  which  the  four  MANOVA  tests  do  not  lead  to  the  same 
conclusion. 

For  AB,  the  eigenvalues  of  H  i/,'  are  .0651  and  .0075,  from  which 


V(s>  =  Y'  =  .0685,  U(s)  =  .0726,  —UU)  =  .580, 

1  +  Xj  vh 

T-i 

e  =  — —  =  .0611. 

1+T.j 

The  critical  values  remain  the  same  as  for  factor  B,  and  all  three  tests  accept  Hq,  as 
does  Wilks’  A.  With  a  nonsignificant  interaction,  interpretation  of  the  main  effects 
is  simplified.  □ 


6.6  OTHER  MODELS 

6.6.1  Higher  Order  Fixed  Effects 

A  higher  order  (balanced)  fixed-effects  model  or  factorial  experiment  presents  no 
new  difficulties.  As  an  illustration,  consider  a  three-way  classification  with  three 
factors  A,  B,  and  C  and  all  interactions  AB,  AC,  BC,  and  ABC.  The  observation 
vector  y  has  p  variables  as  usual.  The  MANOVA  model  allowing  for  main  effects 
and  interactions  can  be  written  as 

y  ijki  =  ft +  ai+  Pj  +  yk  +  Sij  +  TJ  ik  +  Tjk  +  < pijk  +  Bijkh  (6.76) 

where,  for  example,  a,  is  the  effect  of  the  ;th  level  of  factor  A  on  each  of  the  p 
variables  in  v/ //_-/  and  5,-y  is  the  AB  interaction  effect  on  each  of  the  p  variables. 
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Similarly,  r ),■%,  r jk,  and  c ftjjk  represent  the  AC,  BC,  and  ABC  interactions  on  each 
of  the  p  variables. 

The  matrices  of  sums  of  squares  and  products  for  main  effects,  interactions,  and 
error  are  defined  in  a  manner  similar  to  that  for  the  matrices  detailed  for  the  two-way 
model  in  Section  6.5.2.  The  sum  of  squares  (on  the  diagonal)  for  each  variable  is  cal¬ 
culated  exactly  the  same  as  in  a  univariate  ANOVA  for  a  three-way  model.  The  sums 
of  products  (off-diagonal)  are  obtained  analogously.  Test  construction  parallels  that 
for  the  two-way  model,  using  the  matrix  for  error  to  test  all  factors  and  interactions. 

Degrees  of  freedom  for  each  factor  are  the  same  as  in  the  corresponding  three- 
way  univariate  model.  All  four  MANOVA  test  statistics  can  be  computed  for  each 
test.  Contrasts  can  be  defined  and  tested  in  a  manner  similar  to  that  in  Section  6.5.2. 
Follow-up  procedures  on  the  individual  variables  ( /-’-tests  and  discriminant  func¬ 
tions)  can  be  used  as  discussed  for  the  one-way  or  two-way  models  in  Sections  6.4 
and  6.5.2. 

6.6.2  Mixed  Models 

There  is  a  MANOVA  counterpart  for  every  univariate  ANOVA  design.  This  applies 
to  fixed,  random,  and  mixed  models  and  to  experimental  structures  that  are  crossed, 
nested,  or  a  combination.  Roebruck  (1982)  has  provided  a  formal  proof  that  univari¬ 
ate  mixed  models  can  be  generalized  to  multivariate  mixed  models.  Schott  and  Saw 
(1984)  have  shown  that  for  the  one-way  multivariate  random  effects  model,  the  like¬ 
lihood  ratio  approach  leads  to  the  same  test  statistics  involving  the  eigenvalues  of 
E-1H  as  in  the  fixed-effects  model. 

In  the  (balanced)  MANOVA  mixed  model,  the  expected  mean  square  matrices 
have  exactly  the  same  pattern  as  expected  mean  squares  for  the  corresponding  uni¬ 
variate  ANOVA  model.  Thus  a  table  of  expected  mean  squares  for  the  terms  in  the 
corresponding  univariate  model  provides  direction  for  choosing  the  appropriate  error 
matrix  to  test  each  term  in  the  MANOVA  model.  However,  if  the  matrix  indicated 
for  “error”  has  fewer  degrees  of  freedom  than  p,  it  will  not  have  an  inverse  and  the 
test  cannot  be  made. 

To  illustrate,  suppose  we  have  a  (balanced)  two-way  MANOVA  model  with  A 
fixed  and  B  random.  Then  the  (univariate)  expected  mean  squares  (EMS)  and  Wilks’ 
A -tests  are  as  follows: 

Source  EMS  A 

A  a2  +  na2AB  +  nbaf  |HAB|/|HAS  +  HA| 

B  a2  +  naa j  |E|/|E  +  HB| 

AB  a2  +  na\B  |E|/|E  +  HAB| 

Error  a 2 

In  the  expected  mean  square  for  factor  A,  we  have  used  the  notation  er^2  in  place 
of  ai/(a  ~  I  )•  The  test  for  A  using  H ab  for  error  matrix  will  be  indeterminate 
(of  the  form  0/0)  if  vab  <  P ,  where  vab  =  (a  —  1  )(b  —  1).  In  this  case,  vab 
will  often  fail  to  exceed  p.  For  example,  suppose  A  has  two  levels  and  B  has  three. 
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Then  vab  —  2,  which  will  ordinarily  be  less  than  p.  In  such  a  case,  we  would 
have  little  recourse  except  to  compute  univariate  tests  on  the  p  individual  variables. 
However,  we  would  not  have  the  multivariate  test  to  protect  against  carrying  out  too 
many  univariate  tests  and  thereby  inflating  the  experimentwise  a  (see  Section  6.4). 
To  protect  against  inflation  of  a  when  making  p  tests,  we  could  use  a  Bonferroni 
correction,  as  in  procedure  2  in  Section  5.5.  In  the  case  of  F-tests,  we  do  not  have 
a  table  of  Bonferroni  critical  values,  as  we  do  for  /-tests  (Table  A. 8),  but  we  can 
achieve  an  equivalent  result  by  comparing  the  /> values  for  the  f’-tests  against  a/ p 
instead  of  against  a. 

As  another  illustration,  consider  the  analysis  for  a  (balanced)  multivariate  split- 
plot  design.  For  simplicity,  we  show  the  associated  univariate  model  in  place  of  the 
multivariate  model.  We  use  the  factor  names  A,  B,  AC, ...  to  indicate  parameters  in 
the  model: 


yijkl  —  B  +  Ai  +  B(i)j  +  C*  +  ACik  +  BC(i)jk  +  £(i  jk)h 

where  A  and  C  are  fixed  and  B  is  random.  Nesting  is  indicated  by  bracketed  sub¬ 
scripts;  for  example,  B  and  BC  are  nested  in  A.  Table  6.7  shows  the  expected  mean 
squares  and  corresponding  Wilks  tests. 


Table  6.7.  Wilks’  A-Tests  for  a  Typical  Split-Plot  Design 


Source 

df 

Expected  Mean  Squares 

Wilks'  A 

A 

a  —  1 

a2  +  ceag  +  bceaf- 

|Hb|/|Ha  +  Hb| 

B 

a(b  —  1) 

a2  +  cecrj 

|E|/|HS  +  E| 

C 

c-  1 

a2  +  eOgC  +  abeof? 

|HBcl/|Hc  +  Hacl 

AC 

(«  -  l)(c  -  1) 

a2  +  ea2c  +  beafc 

|HbcI/|Hac  +  Hacl 

BC 

1 

O 

1 

a 

a2  +  ea2c 

|E|/|HBC+E| 

Error 

abc(e  —  1) 

a 2 

Since  we  use  II/;  and  Hgc,  as  well  as  E,  to  make  tests,  the  following  must  hold: 

a(b  —  1)  >  p,  a(b  —  l)(c  —  1)  >  p,  abc{e  —  1)  >  p. 

To  construct  the  other  three  MANOVA  tests,  we  use  eigenvalues  of  the  following 
matrices: 


Source  Matrix 


A 

Ha 

H, 

B 

E-1 

Hfi 

C 

HacHc 

AC 

H^HAC 

BC 

E-1 

Hac 

With  a  table  of  expected  mean  squares,  such  as  those  in  Table  6.7,  it  is  a  sim¬ 
ple  matter  to  determine  the  error  matrix  in  each  case.  For  a  given  factor  or  interac- 
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tion,  such  as  A,  B,  or  AC,  the  appropriate  error  matrix  is  ordinarily  the  one  whose 
expected  mean  square  matches  that  of  the  given  factor  except  for  the  last  term.  For 
example,  factor  C,  with  expected  mean  square  o2  +  eujfC  +  abea^r,  is  tested  by 
BC,  whose  expected  mean  square  is  a2  +  ecrjc.  If  //<) :  <j'c S2  =  0  is  true,  then  C  and 
BC  have  the  same  expected  mean  square. 

In  some  mixed  and  random  models,  certain  terms  have  no  available  error  term. 
When  this  happens  in  the  univariate  case,  we  can  construct  an  approximate  test 
using  Satterth waites’  (1941)  or  other  synthetic  mean  square  approach.  For  a  simi¬ 
lar  approach  in  the  multivariate  case,  see  Khuri,  Mathew,  and  Nel  (1994). 

6.7  CHECKING  ON  THE  ASSUMPTIONS 

In  Section  6.2  we  discussed  the  robustness  of  the  four  MANOVA  test  statistics  to 
nonnormality  and  heterogeneity  of  covariance  matrices.  The  MANOVA  tests  (except 
for  Roy’s)  are  rather  robust  to  these  departures  from  the  assumptions,  although,  in 
general,  as  dimensionality  increases,  robustness  decreases. 

Even  though  MANOVA  procedures  are  fairly  robust  to  departures  from  multivari¬ 
ate  normality,  we  may  want  to  check  for  gross  violations  of  this  assumption.  Any  of 
the  tests  or  plots  from  Section  4.4  could  be  used.  For  a  two-way  design,  for  example, 
the  tests  could  be  applied  separately  to  the  n  observations  in  each  individual  cell  (if  n 
is  sufficiently  large)  or  to  all  the  residuals.  The  residual  vectors  after  fitting  the  model 
y ijk  =  B-ij  +  e ijk  would  be 

hjk  =  yijk  -  fij.,  i  =  l,2,...,a,  j  =  1,2,...  ,b  £=1,2 

It  is  also  advisable  to  check  for  outliers,  which  can  lead  to  either  a  Type  I  or  Type  II 
error.  The  tests  of  Section  4.5  can  be  run  separately  for  each  cell  (for  sufficiently  large 
n )  or  for  all  of  the  abn  residuals,  Ejjk  =  y ijk  —  y,y  • 

A  test  of  the  equality  of  covariance  matrices  can  be  made  using  Box’s  M- test 
given  in  Section  7.3.2.  Note  the  cautions  expressed  there  about  the  sensitivity  of  this 
test  to  nonnormality  and  unequal  sample  sizes. 

The  assumption  of  independence  of  the  observation  vectors  y is  even  more 
important  than  the  assumptions  of  normality  and  equality  of  covariance  matrices. 
We  are  referring,  of  course,  to  independence  from  one  observation  vector  to  another. 
The  variables  within  a  vector  are  assumed  to  be  correlated,  as  usual.  In  the  univariate 
case,  Barcikowski  (1981)  showed  that  a  moderate  amount  of  dependence  among  the 
observations  produces  an  actual  a  much  greater  than  the  nominal  a.  This  effect  is 
to  be  expected,  since  the  dependence  leads  to  an  underestimate  of  the  variance,  so 
that  MSE  is  reduced  and  the  E-statistic  is  inflated.  We  can  assume  that  this  effect  on 
error  rates  carries  over  to  MANOVA. 

In  univariate  AN OVA,  a  simple  measure  of  dependence  among  the  kn  observa¬ 
tions  in  a  one-way  model  is  the  intraclass  correlation : 

MSB  -  MSE 
MSB  +  («-  1)MSE’ 


r. 


(6.77) 
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where  MSB  and  MSE  are  the  between  and  within  mean  squares  for  the  variable  and 
n  is  the  number  of  observations  per  group.  This  could  be  calculated  for  each  variable 
in  a  MANOVA  to  check  for  independence. 

In  many  experimental  settings,  we  do  not  anticipate  a  lack  of  independence.  But 
in  certain  cases  the  observations  are  dependent.  For  example,  if  the  sampling  units 
are  people,  they  may  influence  each  other  as  they  interact  together.  In  some  educa¬ 
tional  studies,  researchers  must  use  entire  classrooms  as  sampling  units  rather  than 
use  individual  students.  Another  example  of  dependence  is  furnished  by  observa¬ 
tions  that  are  serially  correlated,  as  in  a  time  series,  for  example.  Each  observation 
depends  to  a  certain  extent  on  the  preceding  one,  and  its  random  movement  is  some¬ 
what  dampened  as  a  result. 


6.8  PROFILE  ANALYSIS 

The  two-sample  profile  analysis  of  Section  5.9.2  can  be  extended  to  k  groups.  Again 
we  assume  that  the  variables  are  commensurate,  as,  for  example,  when  each  subject 
is  given  a  battery  of  tests.  Other  assumptions,  cautions,  and  comments  expressed  in 
Section  5.9.2  apply  here  as  well. 

The  basic  model  is  the  balanced  one-way  MANOVA: 

y  ij—fii  +  Eij,  i  —  1,2,...  ,  k,  j  =  1,2, ...  ,n. 

To  test  Hq  :  fx\  =  /J.2  =  •  •  •  =  fik->  we  use  the  usual  H  and  E  matrices  given  in 
(6.9)  and  (6.10).  If  the  variables  are  commensurate,  we  can  be  more  specific  and 
extend  Hq  to  an  examination  of  the  k  profiles  obtained  by  plotting  the  p  values 
fin,  Hi2,  ■  ■  ■  ,  Hip  in  each  Hi-  as  was  done  with  two  /x, ’s  in  Section  5.9.2  (see,  for 
example,  Figure  5.8).  We  are  interested  in  the  same  three  hypotheses  as  before: 

/-/(H  :  The  k  profiles  are  parallel. 

//02 :  The  k  profiles  are  all  at  the  same  level. 

//o3 :  The  k  profiles  are  flat. 

The  hypothesis  of  parallelism  for  two  groups  was  expressed  in  Section  5.9.2  as 
//o  i  :  Cjxi  =  C/xo,  where  C  is  any  (p-l)xp  matrix  of  rank  p—  1  such  that  Cj  =  0, 
for  example. 


C  = 


/ 


1 

0 


-1 

1 


0 

-1 


°\ 

0 


\  o  o  o  ••  •  -i  / 


For  k  groups,  the  analogous  hypothesis  of  parallelism  is 


Hoi  '■  Cjxi  =  Cjx?  =  •  •  ■  =  C Hk. 


(6.78) 
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The  hypothesis  (6.78)  is  equivalent  to  the  hypothesis  Hq  :  fiz\  —  /x-i  =  ■  ■  ■  =  fJLzk 
in  a  one-way  MANOVA  on  the  transformed  variables  Zjj  —  Cy(- j .  Since  C  has  p  —  1 
rows,  Cyjj  is  (p  —  1)  x  1,  Cpi  is  (p  —  1)  x  1,  and  CSC'  is  ( p  —  1)  x  (p  —  1).  By 
property  lb  in  Section  4.2,  z,j  is  distributed  as  Np-  \  (C/x, ,  CSC')- 

By  analogy  with  (3.64),  the  hypothesis  and  error  matrices  for  testing  Hq\  in  (6.78) 
are 


H-  =  CMC.  E,  =  CEC'. 

We  thus  have 

|  CEC' |  |  CEC' | 

A  ~  ICEC'  +  CHC'I  ~~  |C(E  +  H)C'|  ’  (6'?9) 

which  is  distributed  as  Ap_i,vw ,V£,  where  vh  =  k  —  1  and  ve  —  k(n  —  1). 
The  other  three  MANOVA  test  statistics  can  be  obtained  from  the  eigenvalues  of 
(CEC,)_1(CHC/).  The  test  for  Hq\  can  easily  be  adjusted  for  unbalanced  data, 
as  in  Section  6.1.6.  We  would  calculate  H  and  E  by  (6.32)  and  (6.33)  and  use 
vE  =  fit  -  k. 

The  hypothesis  that  two  profiles  are  at  the  same  level  is  //q2  :  jVi  =  }' l*l  (see 
Section  5.9.2),  which  generalizes  immediately  to  k  profiles  at  the  same  level, 

#02 :  j  ^i  =  j  M2  =  =  j  Mi  (6-80) 


For  two  groups  we  used  a  univariate  t,  as  defined  in  (5.36),  to  test  Hq2-  Similarly,  for 
k  groups  we  can  employ  an  F-test  for  one-way  ANOVA  comparing  k  groups  with 
observations  j'y,y .  Alternatively,  we  can  utilize  (6.79)  with  C  =  j', 


A  = 


m 

jEj+j'Hj 


(6.81) 


which  is  distributed  as  Ai,vw,V£  (p  =  1  because  j'y^-  is  a  scalar).  This  is,  of  course, 
equivalent  to  the  F-test  on  j'y,-,-,  since  by  Table  6.1  in  Section  6.1.3, 


1  —  A  ve 
A  vH 


(6.82) 


is  distributed  as  FVH  VE. 

The  third  hypothesis,  that  of  “flatness,”  essentially  states  that  the  average  of  the  k 
group  means  is  the  same  for  each  variable  (see  (5.37)]: 


#03 : 


Mil  +  M21  +  •  •  •  +  Mi- 1 
k 


M12  +  M22  +  •  •  •  +  M£2 

k 

_  Ml p  +  M2 p  +  •  •  •  +  P-kp 


k 
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or  by  analogy  with  (5.38), 


Hq3- 


C(fl[  +  fl2  +  •  •  •  +  Pk) 
k 


(6.83) 


where  C  is  a  (p  —  1)  x  p  matrix  of  rank  p  —  1  such  that  Cj  =  0  [see  (6.78)].  The 
flatness  hypothesis  can  also  be  stated  as,  the  means  of  all  p  variables  in  each  group 
are  the  same,  or  //,■  i  =  /x/2  =  •  ••  —  pip,  i  =  1,2,...  ,  k.  This  can  be  expressed  as 

#03 :  Cpi  —  Cp,2  —  ■  ■  ■  —  Cpk  —  0. 

To  test  H03  as  given  by  (6.83),  we  can  extend  the  7'2-test  in  (5.39).  The  grand 
mean  vector  {pi  +  P2  +  ■  ■  ■  +  Pk)/k  in  (6.83)  can  be  estimated  as  in  Section  6.1.2  by 


y..  = 


4^  kn  ’ 


Under  H03  (and  Hoi),  Cy  is  Np-  \  (0.  CSC' /kn),  and  H03  can  be  tested  by 

T2  =  &n(Cy  )'(CEC7v£)_1Cy .,  (6.84) 


where  E/v£  is  an  estimate  of  2.  As  in  the  two-sample  case,  1 1 02  is  unaffected  by 
the  status  of  H02 .  When  both  Hoi  and  H03  are  true,  T2  in  (6.84)  is  distributed  as 


Example  6.8.  Three  vitamin  E  diet  supplements  with  levels  zero,  low,  and  high  were 
compared  for  their  effect  on  growth  of  guinea  pigs  (Crowder  and  Hand  1990,  pp.  21- 
29).  Five  guinea  pigs  received  each  supplement  level  and  their  weights  were  recorded 
at  the  end  of  weeks  1,  3,  4,  5,  6,  and  7.  These  weights  are  given  in  Table  6.8. 


Table  6.8.  Weights  of  Guinea  Pigs  under  Three  Levels  of  Vitamin  E  Supplements 


Group 

Animal 

Week  1 

Week  3 

Week  4 

Week  5 

Week  6 

Week  7 

1 

1 

455 

460 

510 

504 

436 

466 

1 

2 

467 

565 

610 

596 

542 

587 

1 

3 

445 

530 

580 

597 

582 

619 

1 

4 

485 

542 

594 

583 

611 

612 

1 

5 

480 

500 

550 

528 

562 

576 

2 

6 

514 

560 

565 

524 

552 

597 

2 

7 

440 

480 

536 

484 

567 

569 

2 

8 

495 

570 

569 

585 

576 

677 

2 

9 

520 

590 

610 

637 

671 

702 

2 

10 

503 

555 

591 

605 

649 

675 

3 

11 

496 

560 

622 

622 

632 

670 

3 

12 

498 

540 

589 

557 

568 

609 

3 

13 

478 

510 

568 

555 

576 

605 

3 

14 

545 

565 

580 

601 

633 

649 

3 

15 

472 

498 

540 

524 

532 

583 
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The  three  mean  vectors  are 


y\  =  (466.4,  519.4,  568.8,  561.6,  546.6,  572.0), 
f2  =  (494.4,  551.0,  574.2,  567.0,  603.0,  644.0), 
y3  =  (497.8,  534.6,  579.8,  571.8,  588.2,  623.2), 


and  the  overall  mean  vector  is 


y'  =  (486.2,  535.0,  574.3,  566.8,  579.3,  613.1). 


A  profile  plot  of  the  means  yt  ,  y2  ,  and  y3  is  given  in  Figure  6.3.  There  is  a  high 
degree  of  parallelism  in  the  three  profiles,  with  the  possible  exception  of  week  6  for 
group  1. 

The  E  and  H  matrices  are  as  follows: 


8481.2 

8538.8 

4819.8 

8538.8 

17170.4 

13293.0 

4819.8 

13293.0 

12992.4 

8513.6 

19476.4 

17077.4 

8710.0 

17034.2 

17287.8 

8468.2 

20035.4 

17697.2 

8513.6 

8710.0 

8468.2  \ 

19476.4 

17034.2 

20035.4 

17077.4 

17287.8 

17697.2 

28906.0 

26226.4 

28625.2 

26226.4 

36898.0 

31505.8 

28625.2 

31505.8 

33538.8  ) 

700 

650- 

600- 

550- 

500- 

450- 


400 


1  3  4 


6 


7 


Week 


Figure  6.3.  Profile  of  the  three  groups  for  the  guinea  pig  data  of  Table  6.8. 
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2969.2 

2177.2 

859.4 

2177.2 

2497.6 

410.0 

859.4 

410.0 

302.5 

813.0 

411.6 

280.4 

4725.2 

4428.8 

1132.1 

5921.6 

5657.6 

1392.5 

813.0 

4725.2 

5921.6 

411.6 

4428.8 

5657.6 

280.4 

1132.1 

1392.5 

260.4 

1096.4 

1352.0 

1096.4 

8550.9 

10830.9 

1352.0 

10830.9 

13730.1 

Using 


C  = 


/  1 
0 
0 
0 
VO 


-1  0  0  0  0  \ 

1-10  0  0 

0  1-1  0  0 
00  1-1  0 

0  0  0  1  -1 


in  the  test  statistic  (6.79),  we  have,  as  a  test  for  parallelism. 


|CEC'|  _  3.8238  x  1018 
~  |C(E  +  H)C'|  ~  2.1355  x  1019 
=  .1791  >  A  05,5,2,12  =  -153. 


Thus  we  do  not  reject  the  parallelism  hypothesis. 

To  test  the  hypothesis  that  the  three  profiles  are  at  the  same  level,  we  use  (6.81), 

j'Ej  632, 605.2 

“  j'Ej+j'Hj  “  632,605.2+  111,288.1 
=  .8504  >  A  os, i.2,i2  —  -607. 


Hence  we  do  not  reject  the  levels  hypothesis.  This  can  also  be  seen  by  using  (6.82) 
to  transform  A  to  F, 

(1  -  A)ve  (1  -  .8504)12  ,  nccc 

F  =  - =  - =  1.0555, 

A  vH  (.8504)2 

which  is  clearly  nonsignificant  ( p  =  .378). 

To  test  the  “flatness”  hypothesis,  we  use  (6.84): 

T2  =  kn(CyJ(CEC'/vErlCy„ 


/ 

-48.80  \ 

! 

( 

714.5 

-13.2 

207.5 

-219.9 

270.2  \ 

-39.27 

-13.2 

298.1 

-174.9 

221.0 

-216.0 

15 

7.47 

207.5 

-174.9 

645.3 

-240.8 

165.8 

-12.47 

-219.9 

221.0 

-240.8 

1112.6 

-649.2 

-33.80  j 

\ 

270.2 

-216.0 

165.8 

-649.2 

618.8  / 
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/ 

X 


-48.80  \ 
-39.27 
7.47 
-12.47 


\  -33.80  / 


=  297.13  >  T 


01.5,12 


49.739. 


Thus  only  the  flatness  hypothesis  is  rejected  in  this  case. 


□ 


6.9  REPEATED  MEASURES  DESIGNS 
6.9.1  Multivariate  vs.  Univariate  Approach 

In  repeated  measures  designs,  the  research  unit  is  typically  a  human  or  animal  sub¬ 
ject.  Each  subject  is  measured  under  several  treatments  or  at  different  points  of  time. 
The  treatments  might  be  tests,  drug  levels,  various  kinds  of  stimuli,  and  so  on.  If 
the  treatments  are  such  that  the  order  of  presentation  to  the  various  subjects  can  be 
varied,  then  the  order  should  be  randomized  to  avoid  an  ordering  bias.  If  subjects 
are  measured  at  successive  time  points,  it  may  be  of  interest  to  determine  the  degree 
of  polynomial  required  to  fit  the  curve.  This  is  treated  in  Section  6.10  as  part  of  an 
analysis  of  growth  curves. 

When  comparing  means  of  the  treatments  applied  to  each  subject,  we  are  exam¬ 
ining  the  within-subjects  factor.  There  will  also  be  a  between-subjects  factor  if  there 
are  several  groups  of  subjects  that  we  wish  to  compare.  In  Sections  6. 9. 2-6. 9. 6,  we 
consider  designs  up  to  a  complexity  level  of  two  within-subjects  factors  and  two 
between-subjects  factors. 

We  now  discuss  univariate  and  multivariate  approaches  to  hypothesis  testing  in 
repeated  measures  designs.  As  a  framework  for  this  discussion,  consider  the  layout 
in  Table  6.9  for  a  repeated  measures  design  with  one  repeated  measures  (within- 
subjects)  factor,  A,  and  one  grouping  (between-subjects)  factor,  B. 

This  design  has  often  been  analyzed  as  a  univariate  mixed-model  nested  design, 
also  called  a  split-plot  design,  with  subjects  nested  in  factor  B  (whole-plot),  which 
is  crossed  with  factor  A  (repeated  measures,  or  split-plot).  The  univariate  model  for 
each  yijr  would  be 


yijr  —  H  +  B,  +  S(i)j  +  Ar  +  BAjr  +  Sjjr,  (6.85) 

where  the  factor  level  designations  ( B ,  S,  A,  and  BA)  from  (6.85)  and  Table  6.9  are 
used  as  parameter  values  and  the  subscript  (i  )  j  on  S  indicates  that  subjects  are  nested 
in  factor  B.  In  Table  6.9,  the  observations  y,y,  for  r  —  1,2,...  ,  p  are  enclosed  in 
parentheses  and  denoted  by  yb  to  emphasize  that  these  p  variables  are  measured 
on  one  subject  and  thus  constitute  a  vector  of  correlated  variables.  The  ranges  of 
the  subscripts  can  be  seen  in  Table  6.9:  i  =  1,2,...  ,k',j  —  1,2,...  ,  n;  and  r  = 
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Table  6.9.  Data  Layout  for  fc-Groups  Repeated  Measures  Experiment 


Factor  B 
(Group) 

Subjects 

Factor  A  (Repeated  Measures) 

Ai 

a2 

Si 

Sn 

(.Till 

V112 

yn  p) 

=  y'n 

Sn 

(,Vl21 

V122 

yi2P) 

=  y'i2 

Sin 

(Vi„i 

Vln2 

.Vl  np) 

=  yi„ 

b2 

S21 

(yin 

V212 

V21  p) 

=  5^21 

S22 

(.V221 

y222 

T22  p) 

=  ^22 

S2„ 

(y2„i 

y2«2 

y2np) 

=  y  'in 

Bk  Ski 

(yni 

ykll 

•  ykip ) 

=  y 'ki 

Sk2 

(V*2l 

ykii 

yap) 

=  y k 

Skn 

(yknl 

yknl 

•  yknp ) 

=  y'kn 

1,2,...  ,  p.  With  factors  A  and  B  fixed  and  subjects  random,  the  univariate  AN OVA 
is  given  in  Table  6.10. 

However,  our  initial  reaction  would  be  to  rule  out  the  univariate  ANOVA  because 
the  y’s  in  each  row  are  correlated  and  the  assumption  of  independence  is  critical, 
as  noted  in  Section  6.7.  We  will  discuss  below  some  assumptions  under  which  the 
univariate  analysis  would  be  appropriate. 

In  the  multivariate  approach,  the  p  responses  y;/ 1 ,  ytj2,  ■  ■  ■  ,  yijp  (repeated  mea¬ 
sures)  for  subject  Sij  constitute  a  vector  y jj,  as  shown  in  Table  6.9.  The  multivariate 
model  for  y ,j  is  a  simple  one-way  MANOVA  model, 

y  ij  =  B-  + Pi  +  Eij,  (6.86) 


Table  6.10.  Univariate  ANOVA  for  Data  Layout  in  Table  6.9 


Source 

df 

MS 

F 

B  (between) 

k-  1 

MSB 

MSB/MSS 

S  (subjects) 

k(n  —  1) 

MSS 

A  (within  or  repeated) 

P-  1 

MSA 

MSA/MSE 

BA 

1 

1 

MSBA 

MSBA/MSE 

Error  (SA  interaction) 

1 

1 

MSE 
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where  /3,  is  a  vector  of  p  main  effects  (corresponding  to  the  p  variables  in  y,-; )  for 
factor  B ,  and  e,y  is  an  error  vector  for  subject  .S'// .  This  model  seems  to  include 
only  factor  B,  but  we  show  in  Section  6.9.3  how  to  use  an  approach  similar  to  pro¬ 
file  analysis  in  Section  6.8  to  obtain  tests  on  factor  A  and  the  BA  interaction.  The 
MANOVA  assumption  that  cov(y//)  =  2  for  all  i  and  j  allows  the  p  repeated  mea¬ 
sures  to  be  correlated  in  any  pattern,  since  2  is  completely  general.  On  the  other 
hand,  the  ANOVA  assumptions  of  independence  and  homogeneity  of  variances  can 
be  expressed  as  cov(y y)  =  ct2I.  We  would  be  very  surprised  if  repeated  measure¬ 
ments  on  the  same  subject  were  independent. 

The  univariate  ANOVA  approach  has  been  found  to  be  appropriate  under  less 
stringent  conditions  than  2  =  a2 1.  Wilks  (1946)  showed  that  the  ordinary  /-’-tests  of 
ANOVA  remain  valid  for  a  covariance  structure  of  the  form 


cov(y  ij)  —  2  =  a2 


(  \  p  p  •••  p 

P  1  P  ■■■  P 

V  p  p  p  •••  1 


\ 

/ 

(6.87) 


=  <r2[(l  —  p)I  +  pj]. 


where  J  is  a  square  matrix  of  l’s,  as  defined  in  (2.12)  [see  Rencher  (2000,  pp.  150— 
151)].  The  covariance  pattern  (6.87)  is  variously  known  as  uniformity,  compound 
symmetry,  or  the  intraclass  correlation  model.  It  allows  for  the  variables  to  be  corre¬ 
lated  but  restricts  every  variable  to  have  the  same  variance  and  every  pair  of  variables 
to  have  the  same  covariance.  In  a  carefully  designed  experiment  with  appropriate 
randomization,  this  assumption  may  hold  under  the  hypothesis  of  no  A  effect.  Alter¬ 
natively,  we  could  use  a  test  of  the  hypothesis  that  2  has  the  pattern  (6.87)  (see  Sec¬ 
tion  7.2.3).  If  this  hypothesis  is  accepted,  one  could  proceed  with  the  usual  ANOVA 
F-tests. 

Bock  (1963)  and  Huynh  and  Feldt  (1970)  showed  that  the  most  general  condition 
under  which  univariate  F-tests  remain  valid  is  that 


C2C'  =  er2I,  (6.88) 

where  C  is  a  (p  —  1)  x  p  matrix  whose  rows  are  orthonormal  contrasts  (orthogonal 
contrasts  that  have  been  normalized  to  unit  length).  We  can  construct  C  by  choosing 
any  p  —  1  orthogonal  contrasts  among  the  means  fi 1 ,  pi- ...  ,  Up  °f  the  repeated 

measures  factor  and  dividing  each  contrast  crllr  by  c2.  (This  matrix 

C  is  different  from  C  used  in  Section  6.8  and  in  the  remainder  of  Section  6.9,  whose 
rows  are  contrasts  that  are  not  normalized  to  unit  length.)  It  can  be  shown  that  (6.87) 
is  a  special  case  of  (6.88).  The  condition  (6.88)  is  sometimes  referred  to  as  sphericity, 
although  this  term  can  also  refer  to  the  covariance  pattern  2  =  cr2I  on  the  untrans¬ 
formed  y ij  (see  Section  7.2.2). 
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A  simple  way  to  test  the  hypothesis  that  (6.88)  holds  is  to  transform  the  data  by 
z ij  —  Cy jj  and  test  Hq  :  X-  =  a2 1,  as  in  Section  7.2.2,  using  CSpiC'  in  place  of 
SPi  =  E  /ve- 

Thus  one  procedure  for  repeated  measures  designs  is  to  make  a  preliminary  test 
for  (6.87)  or  (6.88)  and,  if  the  hypothesis  is  accepted,  use  univariate  E-tests,  as  in 
Table  6.10.  Fehlberg  (1980)  investigated  the  use  of  larger  cn-values  with  a  preliminary 
test  of  structure  of  the  covariance  matrix,  as  in  (6.88).  He  concludes  that  using  a  — 
.40  sufficiently  controls  the  problem  of  falsely  accepting  sphericity  so  as  to  justify 
the  use  of  a  preliminary  test. 

If  the  univariate  test  for  the  repeated  measures  factor  A  is  appropriate,  it  is  more 
powerful  because  it  has  more  degrees  of  freedom  for  error  than  the  corresponding 
multivariate  test.  However,  even  mild  departures  from  (6.88)  seriously  inflate  the 
Type  I  error  rate  of  the  univariate  test  for  factor  A  (Box  1954;  Davidson  1972;  Boik 
1981).  Because  such  departures  can  be  easily  missed  in  a  preliminary  test,  Boik 
(1981)  concludes  that  “on  the  whole,  the  ordinary  E  tests  have  nothing  to  recommend 
them”  (p.  248)  and  emphasized  that  “there  is  no  justification  for  employing  ordinary 
univariate  F  tests  for  repeated  measures  treatment  contrasts”  (p.  254). 

Another  approach  to  analysis  of  repeated  measures  designs  is  to  adjust  the  univari¬ 
ate  E-test  for  the  amount  of  departure  from  sphericity.  Box  (1954)  and  Greenhouse 
and  Geisser  (1959)  showed  that  when  X  ^  cr2I,  an  approximate  E-test  for  effects 
involving  the  repeated  measures  is  obtained  by  reducing  the  degrees  of  freedom  for 
both  numerator  and  denominator  by  a  factor  of 


[tr(X  -  St/p)? 

(E  ~  1)  tr(X  —  JX/ p)2  ’ 


(6.89) 


where  J  is  a  p  x  p  matrix  of  l's  defined  in  (2.12).  For  example,  in  Table  6.10  the 
E-value  for  the  BA  interaction  would  be  compared  to  Fa  with  s(k  —  1  )(p  —  1)  and 
sk(n  —  1  )(p  —  1)  degrees  of  freedom.  An  estimate  s  can  be  obtained  by  substituting 
X  —  E/'T;  in  (6.89).  Greenhouse  and  Geisser  (1959)  showed  that  e  and  s  vary 
between  1  /(p  —  1)  and  1,  with  e  —  I  when  sphericity  holds  and  s  >  i Up  -  i) 
for  other  values  of  X .  Thus  e  is  a  measure  of  nonsphericity.  For  a  conservative  test, 
Greenhouse  and  Geisser  recommend  dividing  numerator  and  denominator  degrees 
of  freedom  by  p  —  1 .  Huynh  and  Feldt  (1976)  provided  an  improved  estimator  of  s. 

The  behavior  of  the  approximate  univariate  E-test  with  degrees  of  freedom 
adjusted  by  e  has  been  investigated  by  Collier  et  al.  (1967),  Huynh  (1978),  Davidson 
(1972),  Rogan,  Keselman,  and  Mendoza  (1979),  and  Maxwell  and  Avery  (1982).  In 
these  studies,  the  true  a -level  turned  out  to  be  close  to  the  nominal  a,  and  the  power 
was  close  to  that  of  the  multivariate  test.  However,  since  the  e-adjusted  E-test  is 
only  approximate  and  has  no  power  advantage  over  the  exact  multivariate  test,  there 
appears  to  be  no  compelling  reason  to  use  it.  The  only  case  in  which  we  need  to  fall 
back  on  a  univariate  test  is  when  there  are  insufficient  degrees  of  freedom  to  perform 
a  multivariate  test,  that  is,  when  p  >  ve- 

In  Sections  6. 9. 2-6. 9. 6,  we  discuss  the  multivariate  approach  to  repeated  mea¬ 
sures.  We  will  cover  several  models,  beginning  with  the  simple  one-sample  design. 
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6.9.2  One-Sample  Repeated  Measures  Model 

We  illustrate  some  of  the  procedures  in  this  section  with  p  =  4.  A  one-sample 
design  with  four  repeated  measures  on  n  subjects  would  appear  as  in  Table  6.11. 
There  is  a  superficial  resemblance  to  a  univariate  randomized  block  design.  How¬ 
ever,  in  the  repeated  measures  design,  the  observations  yn,  v/2,  yn,  and  y ,■  4  are  cor¬ 
related  because  they  are  measured  on  the  same  subject  (experimental  unit),  whereas 
in  a  randomized  block  design  y  1 ,  yn,  yi 3,  and  yy  would  be  measured  on  four  dif¬ 
ferent  experimental  units.  Thus  we  have  a  single  sample  of  n  observation  vectors 

yi,y2,  ••• ,  y«. 

To  test  for  significance  of  factor  A,  we  compare  the  means  of  the  four  variables 
in  y 


\ 


/ 

The  hypothesis  is  Ho  :  p.  1  =  p2  —  M3  —  M4>  which  can  be  reexpressed  as  Hq  :  p  \  — 
M2  =  M2  —  M3  —  M3  —  M4  =  0  or  Ci  p.  =  0,  where 

/  1  — 1  0  0  \ 

Cl =  0  1-1  0  . 

\  0  0  1  -1  / 

To  test  Hq  :  Ci  p.  =  0  for  a  general  value  of  p  ( p  repeated  measures  on  n  subjects), 
we  calculate  y  and  S  from  yi,  y2, . . .  ,  y«  and  extend  Ci  to  p  —  1  rows.  Then  when 
Ho  is  true,  Ciy  is  Np-\(0,  CiSC \/n),  and 

T 2  =  «(Ciy),(CiSC,1)_1(Ciy)  (6.90) 

is  distributed  as  T^_ { (J_1.  We  reject  Ho:  C| p  —  0  if  7' 2  >  7^  j  n_1.  Note  that 
the  dimension  is  p  —  1  because  Ciy  is  (p  —  1)  x  1  [see  (5.33)]. 

The  multivariate  approach  involving  transformed  observations  z,-  =  Ciy,  was 
first  suggested  by  Hsu  (1938)  and  has  been  discussed  further  by  Williams  (1970) 


E(ji)  =  M  = 


/  Ml 

M2 
M3 
V  M4 


Table  6.11.  Data  Layout  for  a  Single-Sample  Repeated 
Measures  Design 


Subjects 

Factor  A  (Repeated  Measures) 

^1 

^2 

^3 

A4 

Si 

(Nil 

yi2 

yn 

yu) 

=  y'i 

s2 

(.V21 

.V22 

V23 

V24) 

=  y2 

Sn 

(y„i 

yn2 

y«3 

y„4) 

=  y'n 
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and  Morrison  (1972).  Note  that  in  Ciy  (for  p  —  4),  we  work  with  contrasts  on  the 
elements  y\,  y2,  ^3,  and  y4  within  the  vector 


(  y i  \ 


v  / 


as  opposed  to  the  contrasts  involving  comparisons  of  several  mean  vectors  them¬ 
selves,  as,  for  example,  in  Section  6.3.2. 

The  hypothesis  Hq:  p 1  =  M2  =  M3  =  M4  can  also  be  expressed  as  Hq\  mi  — 
M4  =  M2  —  M4  =  M3  —  M4  =  0,  or  C2m  =  0,  where 

/  1  0  0  -1  \ 

C2  =  0  1  0  -1  . 

\  0  0  1  -\  ) 

The  matrix  Ci  can  be  obtained  from  C2  by  simple  row  operations,  for  example, 

subtracting  the  second  row  from  the  first  and  the  third  row  from  the  second.  Hence, 
Ci  =  AC 2-  where 


In  fact,  Hq  :  mi  =  M2  =  •  •  •  =  Mp  can  be  expressed  as  Cp  —  0  for  any  full-rank 
(p  —  1)  x  p  matrix  C  such  that  Cj  =  0,  and  the  same  value  of  7' 2  in  (6.90)  will 
result.  The  contrasts  in  C  can  be  either  linearly  independent  or  orthogonal. 

The  hypothesis  7/q  :  Ml  =  M2  =  •  ■  ■  =  pp  —  p,  say,  can  also  be  expressed  as 


Hq  :  m  =  MJ. 


where  j  =  (l,l,...,l)'.  The  maximum  likelihood  estimate  of  //  is 


M  = 


The  likelihood  ratio  test  of  Hq  is  a  function  of 


y'S-1y  - 


j's-'j 


(6.91) 


Williams  (1970)  showed  that  for  any  (p  —  1)  x  p  matrix  C  of  rank  p  —  1  such  that 
Cj  =  0, 
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y's  y 


j'S'j 


(Cy),(CSC/)_1(Cy), 


and  thus  the  T’-test  in  (6.90)  is  equivalent  to  the  likelihood  ratio  test. 


Example  6.9.2.  The  data  in  Table  6.12  were  given  by  Cochran  and  Cox  (1957, 
p.  130).  As  rearranged  by  Timm  (1980),  the  observations  constitute  a  one-sample 
repeated  measures  design  with  two  within-subjects  factors.  Factor  A  is  a  comparison 
of  two  tasks;  factor  B  is  a  comparison  of  two  types  of  calculators.  The  measurements 
are  speed  of  calculation. 

To  test  the  hypothesis  Hq  :  /x  i  =  112  =  713  =  /M,  we  use  the  contrast  matrix 


C  = 


1 

-1 

-1 


-1 

1 

-1 


-1 

-1 

1 


where  the  first  row  compares  the  two  levels  of  A,  the  second  row  compares  the  two 
levels  of  B ,  and  the  third  row  corresponds  to  the  AB  interaction.  From  the  five  obser¬ 
vation  vectors  in  Table  6.12,  we  obtain 


/  23.2  \ 
15.6 
20.0 

V  H-6  / 


51.7 

29.8 

9.2 

7.4 

29.8 

46.8 

16.2 

-8.7 

9.2 

16.2 

8.5 

-10.5 

7.4 

-8.7 

-10.5 

24.3 

For  the  overall  test  of  equality  of  means,  we  have,  by  (6.90), 

T2  =  «(Cy)'(CSC')-1(Cy)  =  29.736  <  T^5  3  4  =  114.986. 


Since  the  7’2-test  is  not  significant,  we  would  ordinarily  not  proceed  with  tests  based 
on  the  individual  rows  of  C.  We  will  do  so,  however,  for  illustrative  purposes.  (Note 
that  the  7’2-test  has  very  low  power  in  this  case,  because  n  —  1  =  4  is  very  small.) 

To  test  A,  B,  and  AB ,  we  test  each  row  of  C,  where  T2  —  n(c.y)'(cJSc,)_1cJy  is 
the  square  of  the  /-statistic 


Table  6.12.  Calculator  Speed  Data 


Subjects 

A 

L1 

a2 

Si 

b2 

Si 

S2 

Si 

30 

21 

21 

14 

s2 

22 

13 

22 

5 

s3 

29 

13 

18 

17 

s4 

12 

7 

16 

14 

s5 

23 

24 

23 

8 
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U  — 


V^cjy 


i  =  1,2,3, 


where  c'  is  the  ith  row  of  C. 

The  three  results  are  as  follows: 


Factor  A  : 
Factor  B  : 
Interaction  A  B  : 


t\  =  1.459  <  fo25,4  =  2.776, 
ti  —  5.247  >  f oo5,4  =  4.604, 
t3  =  -.152. 


Thus  only  the  main  effect  for  B  is  significant.  Note  that  in  all  but  one  case  in  Table 
6. 12,  the  value  for  B\  is  greater  than  that  for  /T-  □ 


6.9.3  1-Sample  Repeated  Measures  Model 

We  turn  now  to  the  1-sample  repeated  measures  design  depicted  in  Table  6.9.  As 
noted  in  Section  6.9.1,  the  multivariate  approach  to  this  repeated  measures  design 
uses  the  one-way  MAN OVA  model  y,/  =  /x  +  [i ,  +  £,;  =  /x,  +  e ,-y.  From  the  1 
groups  of  n  observation  vectors  each,  we  calculate  y  |  ,  v2  .  . . .  ,  yk  and  the  error 
matrix  E. 

The  layout  in  Table  6.9  is  similar  to  that  of  a  1-sample  profile  analysis  in  Sec¬ 
tion  6.8.  To  test  (the  within-subjects)  factor  A,  we  need  to  compare  the  means  of 
the  variables  yi,  y2, . . .  ,yp  within  y  averaged  across  the  levels  of  factor  B.  The  p 
variables  correspond  to  the  levels  of  factor  A.  In  the  model  y jj  =  jn,  +  £(;- ,  the  mean 
vectors  fx\,  fx3, . . .  ,  fXk  correspond  to  the  levels  of  factor  B  and  are  estimated  by  yx  , 
y2  , . . .  .  yk  .  To  compare  the  means  of  vi,  V2 , ...  ,  yp  averaged  across  the  levels  of 
B,  we  use  Jx  —  Y^=i  Mi / which  is  estimated  by  y  =  Y^=i  f i  /k-  The  hypothe¬ 
sis  Hq  :  JI  i  =  ~jZ 2  =  •  •  •  =  Jl  p  comparing  the  means  of  yi,  y2, . . .  ,  yp  (for  factor 
A)  can  be  expressed  using  contrasts: 

Ho :  C7I  =  0,  (6.92) 

where  C  is  any  ( p  —  1)  x  p  full-rank  contrast  matrix  with  Cj  =  0.  This  is  equivalent 
to  the  "flatness”  test  of  profile  analysis,  the  third  test  in  Section  6.8.  Under  Hq,  the 
vector  Cy  is  distributed  as  /V;,_  1  (0,  CXC'/IV),  where  N  =  n,  for  an  unbalanced 
design  and  N  —  kn  in  the  balanced  case.  We  can,  therefore,  test  Ho  with 

T 2  =  N (Cy  )' (CSpiC')  - 1  (Cy  ) ,  (6.93) 

where  Spi  =  E/i >£.  The  7’2-statistic  in  (6.93)  is  distributed  as  T2  j  v/  when  Ho  is 
true,  where  ve  =  N  —  k  [see  (6.84)  and  the  comments  following].  Note  that  the 
dimension  of  T2  is  p  —  1  because  Cy  is  (p  —  1)  x  1. 
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For  the  grouping  or  between- subjects  factor  B,  we  wish  to  compare  the  means  for 
the  k  levels  of  B.  The  mean  response  for  the  /  th  level  of  B  (averaged  over  the  levels 
of  A)  is  Xf=  i  llir / P  —  .1  P-i / P-  The  hypothesis  can  be  expressed  as 

tfo:jVi=j'M2  =  (6-94) 

which  is  analogous  to  (6.80),  the  “levels”  hypothesis  in  profile  analysis.  This  is  easily 
tested  by  calculating  a  univariate  /-’-statistic  for  a  one-way  ANOVA  on  z.ij  =  j'v,/. 
i  =  1,2,...  ,  k\  j  —  1,2,...  ,  n;.  There  is  a  ztj  corresponding  to  each  subject,  Sjj . 
The  observation  vector  for  each  subject  is  thus  reduced  to  a  single  scalar  observation, 
and  we  have  a  one-way  ANOVA  comparing  the  means  j'y |  ,  j'y2  ,  . . .  ,  j'v^,  .  (Note 
that  j'y;  /  p  is  an  average  over  the  p  levels  of  A.) 

The  AB  interaction  hypothesis  is  equivalent  to  the  parallelism  hypothesis  in  pro¬ 
file  analysis  [see  (6.78)], 


H0 :  Cm  =  Cm  —  ■  ■  ■  —  Cm  (6.95) 

In  other  words,  differences  or  contrasts  among  the  levels  of  factor  A  are  the  same 
across  all  levels  of  factor  B.  This  is  easily  tested  by  performing  a  one-way  MANOVA 
on  z ij  —  Cy jj  or  directly  by 


|CEC'| 
|C(E  +  H)C'| 


(6.96) 


[see  (6.78)],  which  is  distributed  as  Ap-\  VH  VE,  with  vh  =  k  —  1  and  i >e  =  N  —  k; 
that  is,  ve  =  —  1)  f°r  the  unbalanced  model  or  i >e  =  k(n  —  1)  in  the  balanced 

model. 


6.9.4  Computation  of  Repeated  Measures  Tests 

Some  statistical  software  packages  have  automated  repeated  measures  procedures 
that  are  easily  implemented.  However,  if  one  is  unsure  as  to  how  the  resulting  tests 
correspond  to  the  tests  in  Section  6.9.3,  there  are  two  ways  to  obtain  the  tests  directly. 
One  approach  is  to  calculate  (6.93)  and  (6.96)  outright  using  a  matrix  manipula¬ 
tion  routine.  We  would  need  to  have  available  the  E  and  H  matrices  of  a  one-way 
MANOVA  using  a  data  layout  as  in  Table  6.9. 

The  second  approach  uses  simple  data  transformations  available  in  virtually  all 
programs.  To  test  (6.92)  for  factor  A,  we  would  transform  each  y,-;  to  z,y  =  Cv;/  by 
using  the  rows  of  C.  For  example,  if 

/  1  — 1  0  0  \ 

C  =  0  1-1  0  , 

\  0  0  1  -1  / 


REPEATED  MEASURES  DESIGNS 


213 


then  each  y7  =  (y i ,  >’2,  34,  34)  becomes  z'  =  (y  1  —  yi,  yi  —  >>3,  >>3  —  34).  We  then 
test  //o :  Hz  —  0  using  a  one-sample  T2  on  all  N  of  the  z,y ’s, 

T 2  =  (Vz'S^z, 

where  N  —  JV  «,• ,  z  =  JV  •  z, 3 /N,  and  S-  =  E z/ve  is  the  pooled  covariance  matrix. 
Reject  H0  if  T2  >  T2p_l  vE. 

To  test  (6.94)  for  factor  B,  we  sum  the  components  of  each  observation  vector  to 

obtain  zij  =  j'jfy  =  y,yi  +  yiji  H - h  }’ijP  and  compare  the  means  zi.,  zi.,  ■■■  ,Zk. 

by  an  E-test,  as  in  one-way  ANOVA. 

To  test  the  interaction  hypothesis  (6.95),  we  transform  each  y,j  to  z;7  =  Cy,-, 
using  the  rows  of  C,  as  before.  Note  that  z,7  is  (p  —  1)  x  1.  We  then  do  a  one-way 
MAN OVA  on  z(/  to  obtain 


A  = 


|Ezl 

|EZ  +  Hz|  ’ 


(6.97) 


6.9.5  Repeated  Measures  with  Two  Within-Subjeets  Factors  and  One 
Between-Subjects  Factor 

The  repeated  measures  model  with  two  within-subjects  factors  A  and  B  and  one 
between-subjects  factor  C  corresponds  to  a  one-way  MANOVA  design  in  which  each 
observation  vector  includes  measurements  on  a  two-way  factorial  arrangement  of 
treatments.  Thus  each  subject  receives  all  treatment  combinations  of  the  two  factors 
A  and  B.  As  usual,  the  sequence  of  administration  of  treatment  combinations  should 
be  randomized  for  each  subject.  A  design  of  this  type  is  illustrated  in  Table  6. 13. 

Each  ytj  in  Table  6.13  has  nine  elements,  consisting  of  responses  to  the  nine  treat¬ 
ment  combinations  A\B\,  A1B2, . . .  ,  A3B3.  We  are  interested  in  the  same  hypothe¬ 
ses  as  in  a  univariate  split-plot  design,  but  we  use  a  multivariate  approach  to  allow 
for  correlated  v’s.  The  model  for  the  observation  vectors  is  the  one-way  MANOVA 
model 


ytj  =  h  +  yi  +  £ij  =  ixi  +  £ij  ’ 


where  y,  is  the  C  effect. 

To  test  factors  A ,  B,  and  AB  in  Table  6.13,  we  use  contrasts  in  the  y’s.  As  an 
example  of  contrast  matrices,  consider 


/  2  2  2  —1  —1  —1  — 1  -1  -1  \ 

^  0  0  0  1  1  1  -1  -1  -1  )  ’ 


B  = 


-1 

1 


(6.98) 


2-1-12  -1  -1  2 
0  1-10  1-10 


(6.99) 
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Table  6.13.  Data  Layout  for  Repeated  Measures  with  Two  Within-Subjects  Factors  and 
One  Between-Subjects  Factor 


Within-Subjects  Factors 

Between  _ 

Subjects  ^2  ^3 


Factor  Subjects  B\  B2  B-<  B  i  Hi  B3  B\  By  B3 


Cl 

Su 

(vin  yn2  yn3  V114  yiis  yii6  yin  yns  yii9)-yn 

Su 

y'12 

Si,n 

y'lnj 

c2 

S21 

y'21 

S22 

y22 

S2n2 

y  2/i2 

C3 

S3 1 

y'3i 

S32 

y32 

S2ny 

y  3«  3 

(  4 

-2 

-2 

-2 

1 

1  -2 

1 

1  \ 

0 

2 

-2 

0  -1 

1 

0 

-1 

1 

0 

0 

0 

2  -1 

-1  -2 

1 

1 

0 

0 

0 

1 

-1 

0 

-1 

1  ) 

The  rows  of  A  are  orthogonal  contrasts  with  two  comparisons: 

A i  vs.  A2  and  A3, 

A2  vs.  A3. 


(6.100) 


Similarly,  B  compares 


B\  vs.  B2  and  B 3, 
B2  vs.  Bt,. 


Other  orthogonal  (or  linearly  independent)  contrasts  could  be  used  for  A  and  B .  The 
matrix  G  is  for  the  AB  interaction  and  is  obtained  from  products  of  the  corresponding 
elements  of  the  rows  of  A  and  the  rows  of  B. 

As  before,  we  define  y  =  JT  -  y ij/N,  Spi  =  E/t>£,  and  N  =  If  there 

were  k  levels  of  C  in  Table  6.13  with  mean  vectors  fji\ ,  /jl 2, . . .  ,  Hk,  then  JZ  = 
S/=i  !xi  / ky  and  the  A  main  effect  corresponding  to  Hq:  AJl  —  0  could  be  tested 
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with 


T 2  =  N (Ay  )'(ASpiA')_1  (Ay  ) ,  (6.101) 

which  is  distributed  as  T2Ve  under  Hq,  where  vE  —  2Zf=t  (ni  ~  !)•  The  dimension 
is  2,  corresponding  to  the  two  rows  of  A. 

Similarly,  to  test  Hq  :  BJZ  =  0  and  Hq  :  GJZ  =  0  for  the  B  main  effect  and  the 
AB  interaction,  respectively,  we  have 

T2  =  N (By  ),(BSpiB')_1(By  ),  (6.102) 

T2  =  A(Gy  )'(GSpiG')_1(Gy  ),  (6.103) 

which  are  distributed  as  T2  VE  and  ,  respectively.  In  general,  if  factor  A  has  a 
levels  and  factor  B  has  b  levels,  then  A  has  a  —  1  rows,  B  has  b  —  1  rows,  and  G  has 
(a  —  I  )(b  —  1)  rows.  The  7’2-statistics  are  then  distributed  as  T2  ,  ,  T?  ,  ,  and 

v  /v  7  a—\,VE^  D—Y,VE, 

T(a-i)(b-i),vE’  respectively. 

Factors  A,  B,  and  AB  can  be  tested  with  Wilks’  A  as  well  as  T2.  Define  H*  = 
N y  y  from  the  partitioning  ^(  /-  v y'  ■  =  E  +  H  +  Ny  y' .  This  can  be  used  to  test 
Hq  :  Jx  —  0  (not  usually  a  hypothesis  of  interest)  by  means  of 


jgj 

|E  +  H*|  ’ 


(6.104) 


which  is  A p,i,VE  if  Hq  is  true.  Then  the  hypothesis  of  interest.  Ho :  A/u  =  0  for 
factor  A,  can  be  tested  with 


IAEA' | 

|A(E  +  H*)A'f 


(6.105) 


which  is  distributed  as  Aa_i  i  VE  when  Hq  is  true,  where  a  is  the  number  of  levels  of 
factor  A.  There  are  similar  expressions  for  testing  factors  B  and  AB.  Note  that  the 
dimension  of  A  in  (6.105)  is  a  —  1,  because  AEA'  is  (a  —  1)  x  (a  —  1). 

The  T2  and  Wilks  A  expressions  in  (6.101)  and  (6.105)  are  related  by 


A  =  —^=2,  (6.106) 

vE  +  Tz 

,  1  -  A 

T2  =  vE — - — .  (6.107) 

A 

We  can  establish  (6.106)  as  follows.  By  (6.28), 


S 

U(s)  =J2xi=  tr[(AEA,)“1(AH*A')] 

i=l 
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=  tr[(AEA')_1ANy  y'  A'] 

=  Atr[(AEA')_1Ay  (Ay  )'] 
=  Wtr[(AyJ,(AEA'r1AyJ 

=  —(Ay  /(ASpjA'r'Ay 

VE 

_  Jl 
VE  ' 


Since  rank  (H*)  =  1,  only  /,  i  is  nonzero,  and 


i=  1 


By  (6.14), 

1  _  i  1 

~  1  +ki  ~  1  +  t/<B  ~~  1  +  T2/ve’ 

which  is  the  same  as  (6.106). 

Factor  C  is  tested  exactly  as  factor  B  in  Section  6.9.3.  The  hypothesis  is 


tfo:  j'Mi  =  j  M2  =  •  ••  =  i'nk. 


as  in  (6.94),  and  we  perform  a  univariate  /-’-test  on  ztj  =  j'y,- j  in  a  one-way  ANOVA 
layout. 

The  AC,  BC ,  and  ABC  interactions  are  tested  as  follows. 


AC  Interaction 

The  AC  interaction  hypothesis  is 

H0:  Ajui  =  A^2  —  ■  ■  ■  —  A nk, 


which  states  that  contrasts  in  factor  A  are  the  same  across  all  levels  of  factor  C.  This 
can  be  tested  by 


IAEA' | 

|A(E  +  H)A'|  ’ 


(6.108) 


which  is  distributed  as  where  a  —  1  =  2  is  the  number  of  rows  of  A  and 

vh  and  Vf  are  from  the  multivariate  one-way  model.  Alternatively,  the  test  can  be 
carried  out  by  transforming  y,- j  to  z ,j  =  Ay(/-  and  doing  a  one-way  MANOVA  on  z,; . 
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BC  Interaction 

The  BC  interaction  hypothesis, 

H0 :  =  B^2  =  •  •  •  =  B ixk. 


is  tested  by 


|BEB'| 

“  |B(E  +  H)B'|  ’ 

which  is  A2,ijw,u£,  where  b  —  1  =  2;  Hq  can  also  be  tested  by  doing  MANOVA  on 

=  By,; 

ABC  Interaction 

The  ABC  interaction  hypothesis, 

Hq  :  G/ai  =  Gh2  =  ■  ■  ■  —  Gp*, 


is  tested  by 


IGEG'I 

|G(E  +  H)G'|  ’ 

which  is  A4  VH  VE,  or  by  doing  MANOVA  on  z;7  =  Gy;  •.  In  this  case  the  dimension 
is  (a  —  V){b  —  1)  =  4. 

The  preceding  tests  for  AC,  BC,  or  ABC  can  be  also  carried  out  with  the  other 
three  MANOVA  test  statistics  using  eigenvalues  of  the  appropriate  matrices.  For 
example,  for  AC  we  would  use  (AEA/)_1(AHA/). 


Example  6.9.5.  The  data  in  Table  6.14  represent  a  repeated  measures  design  with 
two  within- subjects  factors  and  one  between-subjects  factor  (Timm  1980).  Since  A 
and  B  have  three  levels  each,  as  in  the  illustration  in  this  section,  we  will  use  the  A, 
B,  and  G  matrices  in  (6.98),  (6.99),  and  (6. 100).  The  E  and  H  matrices  are  9  x  9  and 
will  not  be  shown.  The  overall  mean  vector  is  given  by 

y;  =  (46.45,  39.25,  31.70,  38.85,  45.40,  40.15,  34.55,  36.90,  39.15). 


By  (6.101),  the  test  for  factor  A  is 

T2  =  (V(Ay  ),(ASpiA,)_1(Ay  ) 


=  20(— .20, 13.80) 


2138.4  138.6 

138.6  450.4 


-1 


-.20 

13.80 


=  8.645  >  T 


05,2,18 


=  7.606. 
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Table  6.14.  Data  from  a  Repeated  Measures  Experiment  with  Two  Within-Subjects 
Factors  and  One  Between-Subjects  Factor 


Between 

Subjects 

Factor 

Subjects 

Within-Subjects  Factors 

^1 

A-2 

^3 

Bi 

b2 

S, 

b2 

S3 

Si 

s2 

S3 

Ci 

Sit 

20 

21 

21 

32 

42 

37 

32 

32 

32 

Sl2 

67 

48 

29 

43 

56 

48 

39 

40 

41 

Jl3 

37 

31 

25 

27 

28 

30 

31 

33 

34 

Jl4 

42 

40 

38 

37 

36 

28 

19 

27 

35 

Sis 

57 

45 

32 

27 

21 

25 

30 

29 

29 

Sw 

39 

39 

38 

46 

54 

43 

31 

29 

28 

Sn 

43 

32 

20 

33 

46 

44 

42 

37 

31 

Sis 

35 

34 

34 

39 

43 

39 

35 

39 

42 

Sl9 

41 

32 

23 

37 

51 

39 

27 

28 

30 

Si, io 

39 

32 

24 

30 

35 

31 

26 

29 

32 

C2 

s2 1 

47 

36 

25 

31 

36 

29 

21 

24 

27 

S22 

53 

43 

32 

40 

48 

47 

46 

50 

54 

S23 

38 

35 

33 

38 

42 

45 

48 

48 

49 

Si  4 

60 

51 

41 

54 

67 

60 

53 

52 

50 

S25 

37 

36 

35 

40 

45 

40 

34 

40 

46 

S26 

59 

48 

37 

45 

52 

44 

36 

44 

52 

S21 

67 

50 

33 

47 

61 

46 

31 

41 

50 

S2g 

43 

35 

27 

32 

36 

35 

33 

33 

32 

S29 

64 

59 

53 

58 

62 

51 

40 

42 

43 

S2,io 

41 

38 

34 

41 

47 

42 

37 

41 

46 

For  factor  B,  we  use  (6. 102)  to  obtain 


T2  =  N(BfJ(BSpiB'r1(ByJ 


=  20(7.15,  10.55) 


305.7  94.0 
94.0  69.8 


=  37.438  >  T^l  2  l8  =  12.943. 


7.15 

10.55 


By  (6.103),  the  test  for  the  AB  interaction  is  given  by 


T2  —  Al(Gy  )'(GSplG')_1(Gy  ) 

=  61.825  >  T20l  4  l8  =  23.487. 


To  test  factor  C,  we  carry  out  a  one-way  ANOVA  on  Zij  —  j/yf- ,■  /9: 
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Source  Sum  of  Squares  df  Mean  Square  F 

Between  3042.22  1  3042.22  8.54 

Error  6408.98  18  356.05 


The  observed  F,  8.54,  has  a  p- value  of  .0091  and  is  therefore  significant. 
The  AC  interaction  is  tested  by  (6.108)  as 


IAEA' |  _  3.058  x  108 

~~  |A(E  +  H)A'|  ~  3.092  x  108 
=  .9889  >  A  05,2,1,18  =  -703. 

For  the  B  C  interaction,  we  have 

| BEIT  |  4.053  x  106 

~  |B(E  +  H)B'|  ~  4.170  x  106 
=  .9718  >  A  05,2,1,18  =  -703. 

For  ABC,  we  obtain 

|GEG'|  2.643  x  1012 

~  |G(E  +  H)G'|  ~  2.927  x  1012 
=  .9029  >  A  .os, 4, 1,18  =  -551. 


In  summary,  factors  A,  B,  and  C  and  the  AB  interaction  are  significant.  □ 


6.9.6  Repeated  Measures  with  Two  Within-Subjects  Factors 
and  Two  Between-Subjects  Factors 

In  this  section  we  consider  a  balanced  two-way  MANOVA  design  in  which  each 
observation  vector  arises  from  a  two-way  factorial  arrangement  of  treatments.  This 
is  illustrated  in  Table  6.15  for  a  balanced  design  with  three  levels  of  all  factors.  Each 
y ijk  has  nine  elements,  consisting  of  responses  to  the  nine  treatment  combinations 
A\B\,  A\Bz, . . .  ,  A3B3  (see  Table  6.13). 

To  test  A,  B,  and  AB,  we  can  use  the  same  contrast  matrices  A,  B,  and  G  as 
in  (6.98)— (6. 100).  We  define  a  grand  mean  vector  y  =  Yijk/N>  where  N  is 
the  total  number  of  observation  vectors;  in  this  illustration,  N  =  27.  In  general, 
N  —  cdn,  where  c  and  d  are  the  number  of  levels  of  factors  C  and  D  and  n  is  the 
number  of  replications  in  each  cell  (in  the  illustration,  n  =  3).  The  test  statistics  for 
A,  B,  and  AB  are  as  follows,  where  Spi  =  E / v  f  and  the  E  matrix  is  obtained  from 
the  two-way  MANOVA  with  ve  =  cd(n  —  1)  degrees  of  freedom. 
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Table  6.15.  Data  Layout  for  Repeated  Measures  with  Two  Within-Subjects  Factors  and 
Two  Between-Subjects  Factors 


Between-Subjects 

Factors 

Within-Subjects  Factors 

Ai 

a2 

a3 

C 

D 

Subject  B\  B2 

S3  Si  S2  S3  Bi 

s2  s3 

Cl 

Ci 

Sin 

y'm 

S 112 

Sll3 

yii3 

c2 

S\2l 

y12t 

S 122 

yi22 

S 123 

yi23 

c3 

Sl31 

yi3i 

S 132 

yi32 

S 133 

yi33 

c2 

Cl 

Sill 

y2n 

So  12 

y2i2 

S213 

y2i3 

c2 

S2  21 

y22i 

C3 

c3 

Cl 

D2 

c3 

S333 

y333 

Factor  A 


T2  =  N  (Ay  )'(ASpiA')-1(Ay  ) 


is  distributed  as  T 2  , 

a—l,VE 

Factor  B 


T2  =  N{YSy  /(BSpiB'r'iBy  ) 


is  distributed  as  Tr  , 

0—1  .VE 

AB  Interaction 


T2  =  iV(Gy  )'(GSpiG')_1(Gy...) 


is  distributed  asT2_m_l)  VE 
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To  test  factors  C,  D,  and  CD,  we  transform  to  Zijk  —  j'y ijk  anc*  carry  out  univari¬ 
ate  /-’-tests  °n  a  two-way  ANOVA  design. 

To  test  factors  AC,  AD,  and  ACD,  we  perform  a  two-way  MANOVA  on  Ay rjk. 
Then,  the  C  main  effect  on  Ay compares  the  levels  of  C  on  Ay,  jk ,  which  is  an 
effective  description  of  the  AC  interaction.  Similarly,  the  D  main  effect  on  Ay jjk 
yields  the  AD  interaction,  and  the  CD  interaction  on  Ay tjk  gives  the  ACD  interac¬ 
tion. 

To  test  factors  BC,  BD,  and  BCD,  we  carry  out  a  two-way  MANOVA  on  By tjk. 
The  C  main  effect  on  By,^,  gives  the  BC  interaction,  the  D  main  effect  on  By {jk 
yields  the  BD  interaction,  and  the  CD  interaction  on  Bv,^,  corresponds  to  the  BCD 
interaction. 

Finally,  to  test  factors  ABC,  ABD,  and  ABCD,  we  perform  a  two-way  MANOVA 
on  Gy ijk.  Then  the  C  main  effect  on  Gy ^  gives  the  ABC  interaction,  the  D  main 
effect  on  Gy ^  yields  the  ABD  interaction,  and  the  CD  interaction  on  Gy ^  corre¬ 
sponds  to  the  ABCD  interaction. 

6.9.7  Additional  Topics 

Wang  (1983)  and  Timm  (1980)  give  a  method  for  obtaining  univariate  mixed-model 
sums  of  squares  from  the  multivariate  E  and  H  matrices.  Crepeau  et  al.  (1985)  con¬ 
sider  repeated  measures  experiments  with  missing  data.  Federer  (1986)  discusses 
the  planning  of  repeated  measures  designs,  emphasizing  such  aspects  as  determining 
the  length  of  treatment  period,  eliminating  carry-over  effects,  the  nature  of  pre-  and 
posttreatment,  the  nature  of  a  response  to  a  treatment,  treatment  sequences,  and  the 
choice  of  a  model.  Vonesh  (1986)  discusses  sample  size  requirements  to  achieve  a 
given  power  level  in  repeated  measures  designs.  Patel  (1986)  presents  a  model  that 
accommodates  both  within-  and  between-subjects  covariates  in  repeated  measures 
designs.  Jensen  (1982)  compares  the  efficiency  and  robustness  of  various  procedures. 

A  multivariate  or  multiresponse  repeated  measurement  design  will  result  if  more 
than  one  variable  is  measured  on  each  subject  at  each  treatment  combination.  Such 
designs  are  discussed  by  Timm  (1980),  Reinsel  (1982),  Wang  (1983),  and  Thomas 
(1983).  Bock  (1975)  refers  to  observations  of  this  type  as  doubly  multivariate  data. 

6.10  GROWTH  CURVES 

When  the  subject  responds  to  a  treatment  or  stimulus  at  successive  time  periods, 
the  pattern  of  responses  is  often  referred  to  as  a  growth  curve.  As  in  repeated  mea¬ 
sures  experiments,  subjects  are  usually  human  or  animal.  We  consider  estimation 
and  testing  hypotheses  about  the  form  of  the  response  curve  for  a  single  sample  in 
Section  6.10.1  and  extend  to  growth  curves  for  several  samples  in  Section  6.10.2. 

6.10.1  Growth  Curve  for  One  Sample 

The  data  layout  for  a  single  sample  growth  curve  experiment  is  analogous  to 
Table  6.11,  with  the  levels  of  factor  A  representing  time  periods.  Thus  we  have 
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a  sample  of  n  observation  vectors  yi,  yi, . . .  ,  y„,  for  which  we  compute  y  and  S. 
The  usual  approach  is  to  approximate  the  shape  of  the  growth  curve  by  a  polyno¬ 
mial  function  of  time.  If  the  time  points  are  equally  spaced,  we  can  use  orthogonal 
polynomials.  This  approach  will  be  described  first,  followed  by  a  method  suitable 
for  unequal  time  intervals. 

Orthogonal  polynomials  are  special  contrasts  that  are  often  used  in  testing  for  lin¬ 
ear,  quadratic,  cubic,  and  higher  order  trends  in  quantitative  factors.  For  a  more  com¬ 
plete  description  and  derivation  see  Guttman  (1982,  pp.  194-207),  Morrison  (1983, 
pp.  182-188),  or  Rencher  (2000,  pp.  323-331).  Here  we  give  only  a  heuristic  intro¬ 
duction  to  the  use  of  these  contrasts. 

Suppose  we  administer  a  drug  to  some  subjects  and  measure  a  certain  reaction  at 
3-min  intervals.  Let  p\,  P2,  P3,  P4,  and  ^5  designate  the  average  responses  at  0,  3, 
6,  9,  and  12  min,  respectively.  To  test  the  hypothesis  that  there  are  no  trends  in  the 
Hj’ s,  we  could  test  Hq:  p\  =  p2  =  ■  ■  ■  =  /r.5  or  Ho :  Cp  =  0  using  the  contrast 
matrix 


/  -2  -1  0  1  2  \ 
2  -1  -2  -1  2 
-1  2  0-2  1 
v  1  -4  6-41, 


(6.109) 


in  T2  =  n(Cy),(CSC,)-1(Cy),  as  in  (6.90).  The  four  rows  of  C  are  orthogonal 
polynomials  that  test  for  linear,  quadratic,  cubic,  and  quartic  trends  in  the  means.  As 
noted  in  Section  6.9.2  ,  any  set  of  orthogonal  contrasts  in  C  will  give  the  same  value 
of  T1  to  test  Hq:  hi  =  112  =  ■■■=  1x5.  However,  in  this  case  we  will  be  interested 
in  using  a  subset  of  the  rows  of  C  to  determine  the  shape  of  the  response  curve. 

Table  A.  13  (Kleinbaum,  Kupper,  and  Muller  1988)  gives  orthogonal  polynomials 
for  p  =  3,  4, . . .  ,  10.  The  p  —  1  entries  for  each  value  of  p  constitute  the  matrix  C. 
Some  software  programs  will  generate  these  automatically. 

As  with  all  orthogonal  contrasts,  the  rows  of  C  in  (6.109)  sum  to  zero  and  are 
mutually  orthogonal.  It  is  also  apparent  that  the  coefficients  in  each  row  increase 
and  decrease  in  conformity  with  the  desired  pattern.  Thus  the  entries  in  the  first  row, 
(—2,  —1,0,  1,  2),  increase  steadily  in  a  straight-line  trend.  The  values  in  the  second 
row  dip  down  and  back  up  in  a  quadratic-type  bend.  The  third-row  entries  increase, 
decrease,  then  increase  in  a  cubic  pattern  with  two  bends.  The  fourth  row  bends  three 
times  in  a  quartic  curve. 

To  further  illustrate  how  the  orthogonal  polynomials  pinpoint  trends  in  the 
means  when  testing  Hq :  Cp  =  0,  consider  the  three  different  patterns  for  p 
depicted  in  Figure  6.4,  where  p'a  =  (8,  8,  8,  8,  8),  p'h  —  (20,  16,  12,  8,  4),  and 
p'c  =  (5,  12,  15,  12,  5).  Let  us  denote  the  rows  of  C  in  (6.109)  as  cj ,  cj,  cj,  and  cj. 
It  is  clear  that  cj  pa  =  0  for  i  =  1 ,  2,  3,  4;  that  is,  when  Ho :  p\  =  ■  ■  ■  =  /X5  is  true, 
all  four  comparisons  confirm  it.  If  p  has  the  pattern  pi„  only  cj  pi,  is  nonzero.  The 
other  rows  are  not  sensitive  to  a  linear  pattern.  We  illustrate  this  for  cj  and  cj : 

cj  Pb  =  (—2)  (20)  +  (-1)(16)  +  (0)  (12)  +  (1)  (8)  +  (2)  (4)  =  -44, 
c '2pb  =  2(20)  -  16  -  2(12)  -  8  +  2(4)  =  0. 
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Figure  6.4.  Three  different  patterns  for  fi. 

For  fjLc ,  only  c '2fic  is  nonzero.  For  example, 

c [nc  =  -2(5)  -  12  +  12  +  2(5)  =  0, 
c2Hc  =  2(5)  -  12  -  2(15)  -  12  +  2(5)  =  -19. 


Thus  each  orthogonal  polynomial  independently  detects  the  type  of  curvature  it  is 
designed  for  and  ignores  other  types.  Of  course  real  curves  generally  exhibit  a  mix¬ 
ture  of  more  than  one  type  of  curvature,  and  in  practice  more  than  one  orthogonal 
polynomial  contrast  may  be  significant. 

To  test  hypotheses  about  the  shape  of  the  curve,  we  therefore  use  the  appropriate 
rows  of  C  in  (6.109).  Suppose  we  suspected  a  priori  that  there  would  be  a  combined 
linear  and  quadratic  trend.  Then  we  would  partition  C  as  follows: 


0  1  2\ 
-2  -1  2  )  ’ 

0  -2  1  \ 

6  -4  1  )■ 


We  would  test  Hq  :  C\  /jl  —  0  by 


T2  =  n(Ciy)/(CiSC'1r1(Ciy), 


which  is  distributed  as  71?  , ,  where  2  is  the  number  of  rows  of  Ci ,  n  is  the  number 

Z,n —  1  ’ 

of  subjects  in  the  sample,  and  y  and  S  are  the  mean  vector  and  covariance  matrix  for 
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the  sample.  Similarly,  Ho :  C2/x  =  0  is  tested  by 

7,2  =  «(C2y),(€2SC'2)-1(C2y), 

which  is  T22  j.  In  this  case  we  might  expect  the  hist  to  reject  Hq  and  the  second  to 
accept  //(). 

If  we  have  no  a  priori  expectations  as  to  the  shape  of  the  curve,  we  could  proceed 
as  follows.  Test  the  overall  hypothesis  Hq  :  C/x  =  0,  and  if  Ho  is  rejected,  use  each 
of  the  rows  of  C  separately  to  test  Ho',  c^/x  =  0,  i  —  1, 2,  3,  4.  The  respective  test 
statistics  are 


which  is  Tr  , ,  and 

4  ,n—  1  ’ 


T2  =  n(Cy)'(CSC')-1(Cy), 


C;V 

tj=  ,  '  ,  i  =  1,2, 3, 4, 

y  C;Sc,-/n 

each  of  which  is  distributed  as  tn- 1  (see  Example  6.9.2). 

In  a  case  where  p  is  large  so  that  /x  has  a  large  number  of  levels,  say  10  or  more, 
we  would  likely  want  to  stop  testing  after  the  first  four  or  five  rows  of  C  and  test 
the  remaining  rows  in  one  group.  However,  for  larger  values  of  p,  most  tables  of 
orthogonal  polynomials  give  only  the  first  few  rows  and  omit  those  corresponding 
to  higher  degrees  of  curvature.  We  can  find  a  matrix  whose  rows  are  orthogonal 
to  the  rows  of  a  given  matrix  as  follows.  Suppose  p  =  11  so  that  C  is  10  x  11 
and  Ci  contains  the  first  five  orthogonal  polynomials.  Then  a  matrix  C2,  with  rows 
orthogonal  to  those  of  Cj,  can  be  obtained  by  selecting  five  linearly  independent 
rows  of 


B  =  I-C'1(CiC'1)“1C1,  (6.110) 

whose  rows  can  easily  be  shown  to  be  orthogonal  to  those  of  Ci.  The  matrix  B 
is  not  full  rank,  and  some  care  must  be  exercised  in  choosing  linearly  independent 
rows.  However,  if  an  incorrect  choice  of  C2  is  made,  the  computer  algorithm  should 
indicate  this  as  it  attempts  to  invert  C2SC2  in  T2  —  «(C2yy(C2SC2)_1(C2y). 

Alternatively,  to  check  for  significant  curvature  beyond  the  rows  of  Ci  without 
finding  C2,  we  can  use  the  test  for  additional  information  in  a  subset  of  variables  in 
Section  5.8.  We  need  not  find  C2  in  order  to  find  the  overall  T2,  since,  as  noted  in 
Section  6.9.2  ,  any  full  rank  (p  —  1)  x  p  matrix  C  such  that  Cj  =  0  will  give  the 
same  value  in  the  overall  72-test  of  Ho :  C/x  =  0.  We  can  conveniently  use  a  simple 
contrast  matrix  such  as 


( 1 

-1 

0  •• 

■  0  ^ 

0 

1 

-1  •• 

0 

V  0 

0 

0  ■■ 

■  -1 ) 
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in 


T2  =  «(Cy)'(CSC')-1(Cy), 


(6.111) 


which  is  T2_{  n  l.  Let  p\  be  the  number  of  orthogonal  polynomials  in  Ci  and  pi  be 
the  number  of  rows  of  C2  if  it  were  available;  that  is  p\  +  pi  —  p  —  1.  Then  the  test 
statistic  for  the  pi  orthogonal  polynomials  in  Ci  is 

T2  =  n(CiyY(CiSC\r\Ciy),  (6.112) 


which  is  T2^  n_l.  We  wish  to  compare  T2  in  (6.112)  to  T2  in  (6.111)  to  check  for 
significant  curvature  beyond  the  rows  of  Ci.  However,  the  test  for  additional  infor¬ 
mation  in  a  subset  of  variables  in  Section  5.8  was  for  the  two-sample  case.  We  can 
adapt  (5.29)  for  use  with  the  one-sample  case,  as  follows.  The  test  for  significance 
of  any  curvature  remaining  after  that  accounted  for  in  Ci  is  made  by  comparing 


(n  -  p i-l) 


T2  -  T2 
n  -  1  +  T2 


with  the  critical  value  T2 

We  now  describe  an  approach  that  can  be  used  when  the  time  points  are  not 
equally  spaced.  It  may  also  be  of  interest  in  the  equal-time-increment  case  because 
it  provides  an  estimate  of  the  response  function. 

Suppose  we  observe  the  response  of  the  subject  at  p  time  points  t\,t2,  ■  ■  ■  ■  tp  and 
that  the  average  response  /z  at  any  time  point  t  is  a  polynomial  in  t  of  degree  k  <  p: 

M  =  A)  +  Pit  +  Pit"  +  •  •  •  +  Pktk  ■ 

This  holds  for  each  point  tr  and  the  corresponding  average  response  p, .  Thus  our 
hypothesis  becomes 


H0: 


'  Ml  \ 

— 

M2 

= 

.  .. 

_ 

Po  +  P\tl  +  Pit 2  +  •  •  •  +  Pkt2 


\Po  +  Pit p  +  Pit2  + - f-  Pkt„) 


which  can  be  expressed  in  matrix  notation  as 


Ho'.  m  =  A/3, 


(6.113) 


(6.114) 


where 
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(  l  h  t2  ■ 

■■  A  \ 

(  Po\ 

A  = 

1 

A 

and  /3  = 

Pi 

U  tp  h  ■ 

■  h> 

\  h  ) 

In  practice,  it  may  be  useful  to  transform  the  t/s  by  subtracting  the  mean  or  the 
smallest  value  in  order  to  reduce  their  size  for  computational  purposes. 

The  following  method  of  testing  Ho  is  due  to  Rao  (1959,  1973).  The  model  p.  = 
A/3  is  similar  to  a  regression  model  E( y)  =  X/3  (see  Section  10.2.1).  However, 
in  this  case,  we  have  cov(y)  =  X  rather  than  a2 1,  as  in  the  standard  regression 
assumption.  In  place  of  the  usual  regression  approach  of  seeking  fi  to  minimize 
SSE  =  (y  —  X/3)'(y  —  X/3)  [see  (10.4)  and  (10.6)],  we  use  a  standardized  distance  as 
in  (3.80),  (y  —  Af})'S~l(y  —  A/3).  The  value  of  /3  that  minimizes  (y  —  A/3)'S-1(y  — 
A/3)  is 


(3  =  (A'S  ‘Ar'A'S  ‘y 


(6.115) 


[see  Rencher  (2000,  Section  7.8.1)],  and  Hq  -  p  =  A/3  can  be  tested  by 

r2  =  «(y-A/3)'S-x(y-A/3),  (6.116) 

which  is  distributed  as  T2_kl  n_] .  The  dimension  of  T2  is  reduced  from  p  to  p  — 

k  —  1  because  k  +  1  parameters  have  been  estimated  in  fi.  The  7'2-statistic  in  (6. 116) 
is  usually  given  in  the  equivalent  form 

T2  =  n(y'S-1y  —  y'S-1A/3).  (6.117) 

The  mean  response  at  the  rth  time  point, 

AU  =  Po  +  Pi  tr  +  Pity  +  •  •  •  +  Pktr 

=  (1  ,tr,t2,...  ,t^)fi  =  KP 

can  be  estimated  by 

Ar=  KP  (6-118) 

Simultaneous  confidence  intervals  for  all  possible  a'  fi  are  given  by 

a  /3  ±  /a'(A'S_1A)_1a  ( 1  +  Y  (6.119) 

V«  V  V  n —  1/ 


where  Ta  —  T2k+l  n_{  is  from  Table  A. 7  and  T2  is  given  by  (6.116)  or  (6.117). 
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The  intervals  in  (6.119)  for  a'/3  include,  of  course,  a'/3  for  the  p  rows  of  A,  that 
is,  confidence  intervals  for  the  p  time  points.  If  a'  ji .  r  =  1,2,...  ,  p,  are  the  only 
values  of  interest,  we  can  shorten  the  intervals  in  (6. 1 19)  by  using  a  Bonferroni  coef¬ 
ficient  ta/2P  in  place  of  Ta : 


<P± 


(A'S~1A)~lar 


(6.120) 


where  ta/2p  —  ta/2p,n-i ■  Bonferroni  critical  values  ta/2p,v  are  given  in  Table  A. 8. 
See  procedures  2  and  3  in  Section  5.5  for  additional  comments  on  the  use  of  ta/ 2P 
and  Ta. 


Example  6.10.1.  Potthoff  and  Roy  (1964)  reported  measurements  in  a  dental  study 
on  boys  and  girls  from  ages  8  to  14.  The  data  are  given  in  Table  6.16. 

To  illustrate  the  methods  of  this  section,  we  use  the  data  for  the  boys  alone.  In 
Example  6.10.2  we  will  compare  the  growth  curves  of  the  boys  with  those  of  the 
girls.  We  first  test  the  overall  hypothesis  Hq  :  Cp  —  0,  where  C  contains  orthogonal 
polynomials  for  linear,  quadratic,  and  cubic  effects: 

/  —3  — 1  13 

C=  1-1-11 

\  -1  3-31 


Table  6.16.  Dental  Measurements 


(6.121) 


Subject 

Girls’  Ages  in  Years 

Subject 

Boys’  Ages  in  Years 

8 

10 

12 

14 

8 

10 

12 

14 

1 

21.0 

20.0 

21.5 

23.0 

1 

26.0 

25.0 

29.0 

31.0 

2 

21.0 

21.5 

24.0 

25.5 

2 

21.5 

22.5 

23.0 

26.5 

3 

20.5 

24.0 

24.5 

26.0 

3 

23.0 

22.5 

24.0 

27.5 

4 

23.5 

24.5 

25.0 

26.5 

4 

25.5 

27.5 

26.5 

27.0 

5 

21.5 

23.0 

22.5 

23.5 

5 

20.0 

23.5 

22.5 

26.0 

6 

20.0 

21.0 

21.0 

22.5 

6 

24.5 

25.5 

27.0 

28.5 

7 

21.5 

22.5 

23.0 

25.0 

7 

22.0 

22.0 

24.5 

26.5 

8 

23.0 

23.0 

23.5 

24.0 

8 

24.0 

21.5 

24.5 

25.5 

9 

20.0 

21.0 

22.0 

21.5 

9 

23.0 

20.5 

31.0 

26.0 

10 

16.5 

19.0 

19.0 

19.5 

10 

27.5 

28.0 

31.0 

31.5 

11 

24.5 

25.0 

28.0 

28.0 

11 

23.0 

23.0 

23.5 

25.0 

12 

21.5 

23.5 

24.0 

28.0 

13 

17.0 

24.5 

26.0 

29.5 

14 

22.5 

25.5 

25.5 

26.0 

15 

23.0 

24.5 

26.0 

30.0 

16 

22.0 

21.5 

23.5 

25.0 
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From  the  16  observation  vectors  we  obtain 


/  22.88  \ 

/  6.02 

2.29 

3.63 

1.61  \ 

23.81 

2.29 

4.56 

2.19 

2.81 

y  = 

25.72 

,  s  = 

3.63 

2.19 

7.03 

3.24 

V  21 A1  ) 

v  1.61 

2.81 

3.24 

4.35  / 

To  test  /A) :  Cfi  —  0,  we  calculate 

T 2  =  n(Cy)'(CSC')~l(Cy)  =  77.957, 

which  exceeds  7’(2n  ,  15  =  19.867.  We  now  test  Hq:  c'  /a  —  0  for  each  row  of  C  to 
determine  the  shape  of  the  growth  curve.  For  the  linear  effect,  using  the  first  row,  Cj, 
we  obtain 


h 


y/c'Sci/n 


=  7.722  >  f oo5, 15  =  2.947. 


The  test  of  significance  of  the  quadratic  component  using  the  second  row  yields 


ti  = 


c,2y 

yjd2Sc2/n 


1.370  <1.025,15  =  2.131. 


To  test  for  a  cubic  trend,  we  use  the  third  row  of  C: 


t3 


y^Sc Tfn 


—  .511  >  —  f  .025, 15  =  —2.131. 


Thus  only  the  linear  trend  is  needed  to  describe  the  growth  curve. 
To  model  the  curve  for  each  variable,  we  use  (6.1 13), 


where 


/x,.  =  fio  +  yS i f,- ,  r=  1,2,  3,4,  or 

ix  =  A/3, 

/  1  -3  \ 

A-  1  -1 

A_  1  1 

V  1  3  / 


The  values  in  the  second  column  of  A  are  obtained  as  t  —  age  —  1 1.  By  (6.1 15),  we 
obtain 


GROWTH  CURVES 


229 


jl  =  (A'S  1A)“1A'S  1f  = 


25.002  \ 
.834  )' 


and  our  prediction  equation  is 

A  =  25.002  +  .834/  =  25.002  +  ,834(age  -  1 1) 

=  15.828  + ,834(age).  □ 


6.10.2  Growth  Curves  for  Several  Samples 

For  the  case  of  several  samples  or  groups,  the  data  layout  would  be  similar  to  that 
in  Table  6.9,  where  the  p  levels  of  factor  A  represent  time  points.  Assuming  the 
time  points  are  equally  spaced,  we  can  use  orthogonal  polynomials  in  the  (p  —  1)  x 
p  contrast  matrix  C  and  express  the  basic  hypothesis  in  the  form  Hq  :  CJZ  —  0, 
where  JI  =  £T=1  M/ /&■  This  is  equivalent  to  Ho  :  JZ  j  —  JI.  2  —  •  •  •  =  JZ  p,  which 
compares  the  means  of  the  p  time  points  averaged  across  groups.  As  in  Section  6.9.3, 
let  us  denote  the  sample  mean  vectors  for  the  k  groups  as  jq  ,  y2  , . . .  ,  V/, ,  with 
grand  mean  y  and  pooled  covariance  matrix  Spi  =  E /v£.  For  the  overall  test  of 
Hq  :  CJZ  =  0  we  use  the  test  statistic 

T2  =  N (Cy  ),(CSpiC,)_1(Cy..),  (6.122) 

which  is  T2_ j  y£  as  in  (6.93),  where  N  —  n,-  for  unbalanced  data  or  N  —  kn 

for  balanced  data.  The  corresponding  degrees  of  freedom  for  error  is  given  by  i >e  — 
N  —  k  or  ve  —  k(n  —  1).  A  test  that  the  average  growth  curve  (averaged  over  groups) 
has  a  particular  form  can  be  tested  with  Ci,  containing  a  subset  of  the  rows  of  C: 

T2  =  ACCjyj'CCiSpjC;)-1^^..),  (6.123) 

which  is  distributed  as  T2^  Vf,  where  pi  is  the  number  of  rows  in  Ci. 

The  growth  curves  for  the  k  groups  can  be  compared  by  the  interaction  or  par¬ 
allelism  test  of  Section  6.9.3  using  either  C  or  Ci.  We  do  a  one-way  MANOVA  on 
Cy ij  or  C  i  y,j ,  or  equivalently  calculate  by  (6.96), 


A  = 


ICEC'I 


or  A  = 


ICjECj  | 


|C(E  +  H)C'|  ”  “  |Ci(E  +  H)C;r 

which  are  distributed  as  Ap-i^-i ,VE  and  Api  i(-itVE,  respectively. 


(6.124) 


Example  6.10.2.  In  Example  6.10.1,  we  found  a  linear  trend  for  the  growth  curve 
for  dental  measurements  of  boys  in  Table  6.16.  We  now  consider  the  growth  curve 
for  the  combined  group  and  also  compare  the  girls’  group  with  the  boys’  group. 

The  two  sample  sizes  are  unequal  and  we  use  (6.33)  to  calculate  the  E  matrix  for 
the  two  groups, 
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/ 

135.39 

67.88 

97.76 

67.76 

\ 

67.88 

103.76 

72.86 

82.71 

97.76 

72.86 

161.39 

103.27 

V 

67.76 

82.71 

103.27 

124.64 

/ 

from  which  we  obtain  Spi  =  E /ve-  Using  the  C  matrix  in  (6.121),  we  can  test  the 
basic  hypothesis  of  equal  means  for  the  combined  samples,  Hq  :  CJZ  —  0,  using 
(6.122): 

T2  =  )V(Cy.  ),(€SplC,)_1(Cy  .) 

=  118.322  >  T20l  3  25  =  15.5  3  8. 

To  test  for  a  linear  trend,  we  use  the  first  row  of  C  in  (6.123): 

T 2  =  N(c\y  )'(ci Spici)-1  (c'jy  ) 

=  99.445  >  r201  1  25  =  7.770. 

This  is,  of  course,  the  square  of  a  f-statistic,  but  in  the  T2  form  it  can  readily  be 
compared  with  the  preceding  T 2  using  all  three  rows  of  C.  The  linear  trend  is  seen 
to  dominate  the  relationship  among  the  means. 

We  now  compare  the  growth  curves  of  the  two  groups  using  (6.124).  For  C,  we 
obtain 

A  |CEC'|  1.3996  x  108 
~  |C(E  +  H)C'|  ~~  1.9025  x  108 
=  .736  >  A  05,3,1,25  =  -717. 

For  the  linear  trend,  we  have 

|C|  Eci  1184.2 

_  |c'1(E  +  H)Ci|  ~  1427.9 

=  .829  <  A.05, 1,1,25  =  -855. 

Thus  the  overall  comparison  does  not  reach  significance,  but  the  more  specific  com¬ 
parison  of  linear  trends  does  give  a  significant  result.  □ 

6.10.3  Additional  Topics 

Jackson  and  Bryce  (1981)  presented  methods  of  analyzing  growth  curves  based  on 
univariate  linear  models.  Snee  (1972)  and  Snee  Acuff,  and  Gibson  (1979)  proposed 
the  use  of  eigenvalues  and  eigenvectors  of  a  matrix  derived  from  residuals  after  fitting 
the  model.  If  one  of  the  eigenvalues  is  dominant,  certain  simplifications  result.  Bryce 
(1980)  discussed  a  similar  simplification  for  the  two-group  case.  Geisser  (1980)  and 
Fearn  (1975,  1977)  gave  the  Bayesian  approach  to  growth  curves,  including  estima¬ 
tion  and  prediction.  Zerbe  (1979a,  b)  provided  a  randomization  test  requiring  fewer 
assumptions  than  normal-based  tests. 
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6.11.1  Test  for  Additional  Information 

In  Section  5.8,  we  considered  tests  of  significance  of  the  additional  information  in 
a  subvector  when  comparing  two  groups.  We  now  extend  these  concepts  to  several 
groups  and  use  similar  notation. 

Let  y  be  a  p  x  1  vector  of  measurements  and  x  be  a  q  x  1  vector  measured 
in  addition  to  y.  We  are  interested  in  determining  whether  x  makes  a  significant 
contribution  to  the  test  of  Hq:  p.\  =  fi2  =  •  •  •  =  Hk  above  and  beyond  y.  Another 
way  to  phrase  the  question  is.  Can  the  separation  of  groups  achieved  by  x  be  predicted 
from  the  separation  achieved  by  y?  It  is  not  necessary,  of  course,  that  x  represent 
new  variables.  It  may  be  that  (£)  is  a  partitioning  of  the  present  variables,  and  we 
wish  to  know  if  the  variables  in  x  can  be  deleted  because  they  do  not  contribute  to 
rejecting  Hq. 

We  consider  here  only  the  one-way  MANOVA,  but  the  results  could  be  extended 
to  higher  order  designs,  where  various  possibilities  arise.  In  a  two-way  context,  for 
example,  it  may  happen  that  x  contributes  nothing  to  the  A  main  effect  but  does 
contribute  significantly  to  the  B  main  effect. 

It  is  assumed  that  we  have  k  samples, 


y»7 


i  =  1,2, 


,  k; 


j  —  1,2,...  ,  n , 


from  which  we  calculate 


E  = 


Evv  E 


E 


-‘yy 

•xy 


yx 

Evv 


H 


Hvv  H 


H 


lyy 

xy 


yx 

Hx.v 


where  E  and  H  are  (p  +  q)  x  ( p  +  q )  and  Eyy  and  Hvv  are  p  x  p. 
Then 


A(y,  x) 


|E| 

|E  +  H| 


(6.125) 


is  distributed  as  A  p+q  VH  VE  and  tests  the  significance  of  group  separation  using  the 
full  vector  (^) .  In  the  balanced  one-way  model,  the  degrees  of  freedom  are  vh  =  k—  1 
and  ve  —  k(n  —  1).  To  test  group  separation  using  the  reduced  vector  y,  we  can 
compute 


A(y)  = 


IE 


yy  i 


Eyy  +  H 


yy  i 


(6.126) 


which  is  distributed  as  Ap  VH  VE. 

To  test  the  hypothesis  that  the  extra  variables  in  x  do  not  contribute  anything 
significant  to  separating  the  groups  beyond  the  information  already  available  in  y, 
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we  calculate 


A(y,  x) 

A(x|y)  =  W-  (6-127) 

A(y) 

which  is  distributed  as  AqVH^VE-p.  Note  that  the  dimension  of  A(x|y)  is  q,  the  num¬ 
ber  of  x’s.  The  error  degrees  of  freedom,  ve  —  p,  has  been  adjusted  for  the  p  y’s. 
Thus  to  test  for  the  contribution  of  additional  variables  to  separation  of  groups,  we 
take  the  ratio  of  Wilks’  A  for  the  full  set  of  variables  in  (6.125)  to  Wilks’  A  for  the 
reduced  set  in  (6.126).  If  the  addition  of  x  makes  A(y,  x)  sufficiently  smaller  than 
A(y),  then  A(x|y)  in  (6.127)  will  be  small  enough  to  reject  the  hypothesis. 

If  we  are  interested  in  the  effect  of  adding  a  single  x,  then  q  =  1,  and  (6.127) 
becomes 


A(x|yi, ...  ,  yp)  = 


A(yi, ...  ,yP,x) 

Myi,  ■■■  ,yP) 


(6.128) 


which  is  distributed  as  A  i ,  VH ,  VE -p .  In  this  test  we  are  inquiring  whether  x  reduces  the 
overall  A  by  a  significant  amount.  With  a  dimension  of  1,  the  A-statistic  in  (6.128) 
has  an  exact  A-transformation  from  Table  6.1, 


1  —  A  ve  -  P 
A  vH 


(6.129) 


which  is  distributed  as  FVhVe-p.  The  statistic  (6.128)  is  often  referred  to  as  a  partial 
A-statistic ;  correspondingly,  (6.129)  is  called  a  partial  F -statistic. 

In  (6.128)  and  (6.129),  we  have  a  test  of  the  significance  of  a  variable  in  the 
presence  of  the  other  variables.  For  a  breakdown  of  precisely  how  the  contribution 
of  a  variable  depends  on  the  other  variables,  see  Rencher  (1993;  1998,  Section  4. 1 .6). 

We  can  rewrite  (6.128)  as 


A(yi, . . .  ,  yp,x)  =  A(x|yi, . . .  ,  yp)A(yi, ...  ,yp)  <  A(yi - -  yp ),  (6.130) 

which  shows  that  Wilks’  A  can  only  decrease  with  an  additional  variable. 


Example  6.11.1.  We  use  the  rootstock  data  of  Table  6.2  to  illustrate  tests  on  subvec¬ 
tors.  From  Example  6.1.7,  we  have,  for  all  four  variables,  A(yi,  y2,  y3,  V4)  =  .1540. 
For  the  first  two  variables,  we  obtain  A(  vi ,  V2 )  =  .6990.  Then  to  test  the  significance 
of  y3  and  ya  adjusted  for  y\  and  V2,  we  have  by  (6.127), 


A(y3,  y4lyi,  y2)  = 


A(yi,  y2,  ,V3,  V4) 
A(yi,  y2) 


.1540 

.6990 


.2203, 


which  is  less  than  the  critical  value  A. 05, 2, 5, 40  =  .639. 
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Similarly,  the  test  for  34  adjusted  for  y\ ,  y2,  and  y'3  is  given  by  (6.128)  as 


A(y4|vi,  yi,  B) 


A(yi,  y2,  y3t  B) 
A(yi,  y2,  y3 ) 


.1540 

.2460 


=  -6261  <  A .05, 1.5,39  =  -759. 


For  each  of  the  other  variables,  we  have  a  similar  test: 


B:  A(y3|yi,  >>2,  >’4)  =  =  .5618  <  A.05,1,5,39  =  -759, 

.1540 

>’2:  A(y2|yi,  B>  >’4)  =  -j^  =  .8014  >  A.05,1,5,39  =  -759, 

.1540 

yi:  A(yi|y2,y3..y4)  =  =  .9630  >  A .054,5,39  =  .759. 

Thus  the  two  variables  V3  and  >'4,  either  individually  or  together,  contribute  a 
significant  amount  to  separation  of  the  six  groups.  □ 


6.11.2  Stepwise  Selection  of  Variables 

If  there  are  no  variables  for  which  we  have  a  priori  interest  in  testing  for  significance, 
we  can  do  a  data-directed  search  for  the  variables  that  best  separate  the  groups.  Such 
a  strategy  is  often  called  stepwise  discriminant  analysis,  although  it  could  more  aptly 
be  called  stepwise  MANOVA.  The  procedure  appears  in  many  software  packages. 

We  first  describe  an  approach  that  is  usually  called  forward  selection.  At  the  first 
step  calculate  A(y,-)  for  each  individual  variable  and  choose  the  one  with  minimum 
A  (yt)  (or  maximum  associated  F).  At  the  second  step  calculate  A  ( y,  |  y  1 )  for  each 
of  the  p  —  1  variables  not  entered  at  the  first  step,  where  yi  indicates  the  first  vari¬ 
able  entered.  For  the  second  variable  we  choose  the  one  with  minimum  A(y,jyi)  (or 
maximum  associated  partial  F),  that  is,  the  variable  that  adds  the  maximum  sepa¬ 
ration  to  the  one  entered  at  step  1.  Denote  the  variable  entered  at  step  2  by  y2.  At 
the  third  step  calculate  A(y,jyi,  y2)  for  each  of  the  p  —  2  remaining  variables  and 
choose  the  one  that  minimizes  A(y,-  |yi ,  y2)  (or  maximizes  the  associated  partial  F ). 
Continue  this  process  until  the  F  falls  below  some  predetermined  threshold  value, 
say,  Fin. 

A  stepwise  procedure  follows  a  similar  sequence,  except  that  after  a  variable  has 
entered,  the  variables  previously  selected  are  reexamined  to  see  if  each  still  con¬ 
tributes  a  significant  amount.  The  variable  with  smallest  partial  F  will  be  removed 
if  the  partial  F  is  less  than  a  second  threshold  value,  Fout.  If  Fout  is  the  same  as 
there  is  a  very  small  possibility  that  the  procedure  will  cycle  continuously  without 
stopping.  This  possibility  can  be  eliminated  by  using  a  value  of  Fout  slightly  less  than 
Ain.  For  an  illustration  of  the  stepwise  procedure,  see  Example  8.9. 
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PROBLEMS 


6.1  Verify  the  computational  forms  given  in  (6.3)  and  (6.5);  that  is,  show  that 

(a)  J2ij(y>j  -  Jl)2  =  T,ij  yfj  -  £,■  yfjn > 

(b)  n  (Ji.  -  y..)2  -  £,-  yfjn  -  y2/kn. 

6.2  Show  that  Wilks’  A  can  be  expressed  in  terms  of  the  eigenvalues  of  E_  1 H  as 
in  (6.14). 

6.3  Show  that  the  eigenvalues  of  E_1H  are  the  same  as  those  of  ( E 1  -/2 )  ~ 1  H  ( E 1  /2 )  " 1 , 
as  noted  in  Section  6.1.4,  where  E1/2  is  the  square  root  matrix  defined  in 
(2.112). 


6.4  Show  that  Fj  in  (6.27)  is  the  same  as  F\  in  (6.25). 

6.5  Show  that  if  p  <  v/f,  then  f  3  in  (6.31)  is  the  same  as  F2  in  (6.30). 

6.6  Show  that  if  there  is  only  one  nonzero  eigenvalue  Aj,  then  U (  l  1 ,  V(l> ,  and  A 
can  be  expressed  in  terms  of  6,  as  in  (6.34)-(6.36). 

6.7  Show  that  (5.16),  (5.18),  and  (5.19),  which  relate  T2  to  A,  V(s) ,  and  0,  follow 
from  (6.34M6.36)  and  (6.39),  t/(1)  =  T2/(nx  +  «2  -  2). 


6.8  Verify  the  computational  forms  of  H  and  E  in  (6.32)  and  (6.33);  that  is,  show 
that 


6.9 


(a)  £/= 1 «;( Jl  -  y..)(y  1  -  y =  £f=iyi.yj /«/  -  y..y '../n, 

(b)  £?=i  £"Li(y u  -  -  y =  £?=  1  £"Li  yyy y  -  £?=i  y«-.y {./««•• 

Show  that  for  two  groups,  H  =  11  i  (y /  —  y  )(y,-  —  y  )'  can  be  expressed 

as  H  =  [«i«2/(«i  +«2)](yi  —  y2  )(yi  —  y2  thus  verifying  (6.38).  Note  that 


-  _  »iyi,  +  »2y2. 

n  1  +  «2 

6.10  Show  that  0  can  be  expressed  as  6  —  SSH(z)/[SSE(z)  +  SSH(z)]  as  in  (6.42). 

6.11  Show  that 


n 


1 

1  +  A; 


s 


n<‘-7). 


as  in  (6.45),  where  rf  —  A ,•  /(I  +  A,-). 

6.12  Show  that  the  F-approximation  based  on  A  p  in  (6.50)  reduces  to  (6.26)  if 
AP  =  V(i)/s,as  in  (6.49). 

6.13  Show  that  if  s  =  1,  Alh  in  (6.51)  reduces  to  (6.43). 

6.14  Show  that  the  /-’-approximation  denoted  by  F3  in  (6.31)  is  equivalent  to  (6.52). 

6.15  Show  that  cov(S)  =  ^  £/=1  c2  as  in  (6.61). 
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6.16  If  z ij  =  Cy jj,  where  C  is  (p  —  1)  x  p,  show  that  H-  =  CIIC'  and  E,  =  CEC', 
as  used  in  (6.79). 

6.17  Why  do  C  and  C  not  “cancel  out”  of  Wilks’  A  in  (6.79)? 

6.18  Show  that  under  //o3  and  Hoi,  Cy  is  Np-\ (0,  CXC' / kn),  as  noted  preceding 
(6.84). 

6.19  Show  that  T 2  =  kn  ( Cy  ) ' (CEC' / v £ )  “ 1  Cy  in  (6.84)  is  distributed  as  T2_ l  ve . 

6.20  For  e  defined  by  (6.89),  show  that  e  =  1  when  X  =  <r2I. 

6.21  Give  a  justification  of  the  Wilks’  A  test  of  Hq  :  JZ  =0  in  (6.104). 

6.22  Provide  an  alternative  derivation  of  (6.106),  A  =  ve/(ve  +  7  2),  starting  with 
(6.105). 

6.23  Obtain  T2  in  terms  of  A  in  (6.107),  starting  with  (6.106). 

6.24  Show  that  the  rows  of  Ci  are  orthogonal  to  those  of  B  =  I  —  Cj  (Ci  Cj )_ 1  Ci 
in  (6.110). 

6.25  Show  that  /3  in  (6.115)  minimizes  (y  —  Aj8/S-1(y  —  A /3). 

6.26  Show  that  T2  in  (6.1 17)  is  equivalent  to  T2  in  (6.1 16). 

6.27  Baten,  Tack,  and  Baeder  (1958)  compared  judges’  scores  on  fish  prepared  by 
three  methods.  Twelve  fish  were  cooked  by  each  method,  and  several  judges 
tasted  fish  samples  and  rated  each  on  four  variables:  y i  =  aroma,  y2  =  flavor, 
V3  =  texture,  and  V4  =  moisture.  The  data  are  in  Table  6.17.  Each  entry  is  an 
average  score  for  the  judges  on  that  fish. 

(a)  Compare  the  three  methods  using  all  four  MANOVA  tests. 


Table  6.17.  Judges’  Scores  on  Fish  Prepared  by  Three  Methods 

Method  1  Method  2  Method  3 


yi 

y2 

>3 

T4 

yi 

T2 

,V3 

,V4 

yi 

T2 

T3 

T4 

5.4 

6.0 

6.3 

6.7 

5.0 

5.3 

5.3 

6.5 

4.8 

5.0 

6.5 

7.0 

5.2 

6.2 

6.0 

5.8 

4.8 

4.9 

4.2 

5.6 

5.4 

5.0 

6.0 

6.4 

6.1 

5.9 

6.0 

7.0 

3.9 

4.0 

4.4 

5.0 

4.9 

5.1 

5.9 

6.5 

4.8 

5.0 

4.9 

5.0 

4.0 

5.1 

4.8 

5.8 

5.7 

5.2 

6.4 

6.4 

5.0 

5.7 

5.0 

6.5 

5.6 

5.4 

5.1 

6.2 

4.2 

4.6 

5.3 

6.3 

5.7 

6.1 

6.0 

6.6 

6.0 

5.5 

5.7 

6.0 

6.0 

5.3 

5.8 

6.4 

6.0 

6.0 

5.8 

6.0 

5.2 

4.8 

5.4 

6.0 

5.1 

5.2 

6.2 

6.5 

4.0 

5.0 

4.0 

5.0 

5.3 

5.1 

5.8 

6.4 

4.8 

4.6 

5.7 

5.7 

5.7 

5.4 

4.9 

5.0 

5.9 

6.1 

5.7 

6.0 

5.3 

5.4 

6.8 

6.6 

5.6 

5.2 

5.4 

5.8 

6.1 

6.0 

6.1 

6.2 

4.6 

4.4 

5.7 

5.6 

5.8 

6.1 

5.2 

6.4 

6.2 

5.7 

5.9 

6.0 

4.5 

4.0 

5.0 

5.9 

5.3 

5.9 

5.8 

6.0 

5.1 

4.9 

5.3 

4.8 

4.4 

4.2 

5.6 

5.5 

Source :  Baten.  Tack,  and  Baeder  (1958,  p.  8). 
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(b)  Compute  the  following  measures  of  multivariate  association  from  Sec¬ 
tion  6.1.8:  r)\,  r)j,  Aa,  Al h,  Ap. 

(c)  Based  on  the  eigenvalues,  is  the  essential  dimensionality  of  the  space  con¬ 
taining  the  mean  vectors  equal  to  1  or  2? 

(d)  Using  contrasts,  test  the  following  two  comparisons  of  methods:  1  and  2 
vs.  3,  and  1  vs.  2. 

(e)  If  any  of  the  four  tests  in  (a)  is  significant,  run  an  ANOVA  f  -test  on  each 
y,  and  examine  the  discriminant  function  z  =  a'y  (Section  6.4). 

(f)  Test  the  significance  of  V3  and  y 4  adjusted  for  yi  and  yi. 

(g)  Test  the  significance  of  each  variable  adjusted  for  the  other  three. 

6.28  Table  6.18,  from  Keuls,  Martakis,  and  Magid  (1984),  gives  data  from  a  two- 

way  (fixed-effects)  MANOVA  on  snap  beans  showing  the  results  of  four  vari- 


Table  6.18.  Snapbean  Data 


s 

V 

yi 

T2 

,V3 

v4 

5 

V 

yi 

,V2 

T3 

U 

1 

1 

1 

59.3 

4.5 

38.4 

295 

3 

1 

1 

68.1 

3.4 

42.2 

280 

2 

60.3 

4.5 

38.6 

302 

2 

68.0 

2.9 

42.4 

284 

3 

60.9 

5.3 

37.2 

318 

3 

68.5 

3.3 

41.5 

286 

4 

60.6 

5.8 

38.1 

345 

4 

68.6 

3.1 

41.9 

284 

5 

60.4 

6.0 

38.8 

325 

5 

68.6 

3.3 

42.1 

268 

1 

2 

1 

59.3 

6.7 

37.9 

275 

3 

2 

1 

64.0 

3.6 

40.9 

233 

2 

59.4 

4.8 

36.6 

290 

2 

63.4 

3.9 

41.4 

248 

3 

60.0 

5.1 

38.7 

295 

3 

63.5 

3.7 

41.6 

244 

4 

58.9 

5.8 

37.5 

296 

4 

63.4 

3.7 

41.4 

266 

5 

59.5 

4.8 

37.0 

330 

5 

63.5 

4.1 

41.1 

244 

1 

3 

1 

59.4 

5.1 

38.7 

299 

3 

3 

1 

68.0 

3.7 

42.3 

293 

2 

60.2 

5.3 

37.0 

315 

2 

68.7 

3.5 

41.6 

284 

3 

60.7 

6.4 

37.4 

304 

3 

68.7 

3.8 

40.7 

277 

4 

60.5 

7.1 

37.0 

302 

4 

68.4 

3.5 

42.0 

299 

5 

60.1 

7.8 

36.9 

308 

5 

68.6 

3.4 

42.4 

285 

2 

1 

1 

63.7 

5.4 

39.5 

271 

4 

1 

1 

69.8 

1.4 

48.4 

265 

2 

64.1 

5.4 

39.2 

284 

2 

69.5 

1.3 

47.8 

247 

3 

63.4 

5.4 

39.0 

281 

3 

69.5 

1.3 

46.9 

231 

4 

63.2 

5.3 

39.0 

291 

4 

69.9 

1.3 

47.5 

268 

5 

63.2 

5.0 

39.0 

270 

5 

70.3 

1.1 

47.1 

247 

2 

2 

1 

60.6 

6.8 

38.1 

248 

4 

2 

1 

66.6 

1.8 

45.7 

205 

2 

61.0 

6.5 

38.6 

264 

2 

66.5 

1.7 

46.8 

239 

3 

60.7 

6.8 

38.8 

257 

3 

67.1 

1.7 

46.3 

230 

4 

60.6 

7.1 

38.6 

260 

4 

65.8 

1.8 

46.3 

235 

5 

60.3 

6.0 

38.5 

261 

5 

65.6 

1.9 

46.1 

220 

2 

3 

1 

63.8 

5.7 

40.5 

282 

4 

3 

1 

70.1 

1.7 

48.1 

253 

2 

63.2 

6.1 

40.2 

284 

2 

72.3 

0.7 

47.8 

249 

3 

63.3 

6.0 

40.0 

291 

3 

69.7 

1.5 

46.7 

226 

4 

63.2 

5.9 

40.0 

299 

4 

69.9 

1.3 

47.1 

248 

5 

63.1 

5.4 

39.7 

295 

5 

69.8 

1.4 

46.7 
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ables:  y\  =  yield  earliness,  yi  =  specific  leaf  area  (SLA)  earliness,  y^  = 
total  yield,  and  V4  =  average  SLA.  The  factors  are  sowing  date  (S)  and  variety 
(V). 

(a)  Test  for  main  effects  and  interaction  using  all  four  MANOVA  statistics. 

(b)  In  previous  experiments,  the  second  variety  gave  higher  yields.  Compare 
variety  2  with  varieties  1  and  3  by  means  of  a  test  on  a  contrast. 

(c)  Test  linear,  quadratic,  and  cubic  contrasts  for  sowing  date.  (Interpretation 
of  these  for  mean  vectors  is  not  as  straightforward  as  for  univariate  means.) 

(d)  If  any  of  the  tests  in  part  (a)  rejects  IIq,  carry  out  ANOVA  F -tests  on  the 
four  variables. 

(e)  Test  the  significance  of  yj  and  >’4  adjusted  for  y  1  and  yi  in  main  effects 
and  interaction. 

(f)  Test  the  significance  of  each  variable  adjusted  for  the  other  three  in  main 
effects  and  interaction. 

6.29  The  bar  steel  data  in  Table  6.6  were  analyzed  in  Example  6.5.2  as  a  two-way 
fixed-effects  design.  Consider  lubricants  to  be  random  so  that  we  have  a  mixed 
model.  Test  for  main  effects  and  interaction. 

6.30  In  Table  6.19,  we  have  a  comparison  of  four  reagents  (Burdick  1979).  The 
first  reagent  is  the  one  presently  in  use  and  the  other  three  are  less  expen- 


Table  6.19.  Blood  Data 


Subject 

Reagent  1 

Reagent  2 

Reagent  3 

Reagent  4 

yi 

^2 

y?> 

>4 

y2 

ys 

,Vl 

y2 

ys 

yi 

y2 

ya 

1 

8.0 

3.96 

12.5 

8.0 

3.93 

12.7 

7.9 

3.86 

13.0 

7.9 

3.87 

13.2 

2 

4.0 

5.37 

16.9 

4.2 

5.35 

17.2 

4.1 

5.39 

17.2 

4.0 

5.35 

17.3 

3 

6.3 

5.47 

17.1 

6.3 

5.39 

17.5 

6.0 

5.39 

17.2 

6.1 

5.41 

17.4 

4 

9.4 

5.16 

16.2 

9.4 

5.16 

16.7 

9.4 

5.17 

16.7 

9.1 

5.16 

16.7 

5 

8.2 

5.16 

17.0 

8.0 

5.13 

17.5 

8.1 

5.10 

17.4 

7.8 

5.12 

17.5 

6 

11.0 

4.67 

14.3 

10.7 

4.60 

14.7 

10.6 

4.52 

14.6 

10.5 

4.58 

14.7 

7 

6.8 

5.20 

16.2 

6.8 

5.16 

16.7 

6.9 

5.13 

16.8 

6.7 

5.19 

16.8 

8 

9.0 

4.65 

14.7 

9.0 

4.57 

15.0 

8.9 

4.58 

15.0 

8.6 

4.55 

15.1 

9 

6.1 

5.22 

16.3 

6.0 

5.16 

16.9 

6.1 

5.14 

16.9 

6.0 

5.21 

16.9 

10 

6.4 

5.13 

15.9 

6.4 

5.11 

16.4 

6.4 

5.11 

16.4 

6.3 

5.07 

16.3 

11 

5.6 

4.47 

13.3 

5.5 

4.45 

13.6 

5.3 

4.46 

13.6 

5.3 

4.44 

13.7 

12 

8.2 

5.22 

16.0 

8.2 

5.14 

16.5 

8.0 

5.14 

16.5 

7.8 

5.16 

16.5 

13 

5.7 

5.10 

14.9 

5.6 

5.05 

15.3 

5.5 

5.02 

15.4 

5.4 

5.05 

15.5 

14 

9.8 

5.25 

16.1 

9.8 

5.15 

16.6 

8.1 

5.10 

13.8 

9.4 

5.16 

16.6 

15 

5.9 

5.28 

15.8 

5.8 

5.25 

16.4 

5.7 

5.26 

16.4 

5.6 

5.29 

16.2 

16 

6.6 

4.65 

12.8 

6.4 

4.59 

13.2 

6.3 

4.58 

13.1 

6.4 

4.57 

13.2 

17 

5.7 

4.42 

14.5 

5.5 

4.31 

14.9 

5.5 

4.30 

14.9 

5.4 

4.32 

14.8 

18 

6.7 

4.38 

13.1 

6.5 

4.32 

13.4 

6.5 

4.32 

13.6 

6.5 

4.31 

13.5 

19 

6.8 

4.67 

15.6 

6.6 

4.57 

15.8 

6.5 

4.55 

16.0 

6.5 

4.56 

15.9 

20 

9.6 

5.64 

17.0 

9.5 

5.58 

17.5 

9.3 

5.50 

17.4 

9.2 

5.46 

17.5 
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Table  6.20.  Wear  of  Coated  Fabrics  in  Three  Periods  (mg) 


Surface 

Treatment 

Filler 

Proportion  of  Filler 

Pi  (25%) 

P2  (50%) 

P}  (75%) 

yi 

T2 

V3 

,Vl 

T2 

T3 

Ti 

,V2 

T3 

To 

C 

194 

192 

141 

233 

217 

171 

265 

252 

207 

208 

188 

165 

241 

222 

201 

269 

283 

191 

f2 

239 

127 

90 

224 

123 

79 

243 

117 

100 

187 

105 

85 

243 

123 

110 

226 

125 

75 

Ti 

Fi 

155 

169 

151 

198 

187 

176 

235 

225 

166 

173 

152 

141 

177 

196 

167 

229 

270 

183 

f2 

137 

82 

77 

129 

94 

78 

155 

76 

92 

160 

82 

83 

98 

89 

48 

132 

105 

67 

sive  reagents  that  we  wish  to  compare  with  the  first.  All  four  reagents  are 
used  with  a  blood  sample  from  each  patient.  The  three  variables  measured 
for  each  reagent  are  y\  =  white  blood  count,  V2  =  red  blood  count,  and 
V3  =  hemoglobin  count. 

(a)  Analyze  as  a  randomized  block  design  with  subjects  as  blocks. 

(b)  Compare  the  first  reagent  with  the  other  three  using  a  contrast. 

6.31  The  data  in  Table  6.20,  from  Box  (1950),  show  the  amount  of  fabric  wear  y\ , 
y2,  and  >’3  in  three  successive  periods:  (1)  the  first  1000  revolutions,  (2)  the  sec¬ 
ond  1000  revolutions,  and  (3)  the  third  1000  revolutions  of  the  abrasive  wheel. 
There  were  three  factors:  type  of  abrasive  surface,  type  of  filler,  and  proportion 
of  filler.  There  were  two  replications.  Carry  out  a  three-way  MANOVA,  testing 
for  main  effects  and  interactions.  (Ignore  the  repeated  measures  aspects  of  the 
data.) 

6.32  The  fabric  wear  data  in  Table  6.20  can  be  considered  to  be  a  growth  curve 
model,  with  the  three  periods  (y\ ,  y'2,  >’3 )  representing  repeated  measurements 
on  the  same  specimen.  We  thus  have  one  within-subjects  factor,  to  which  we 
should  assign  polynomial  contrasts  (—1,0,  1)  and  (—1,2,  —1),  and  a  three- 
way  between-subjects  classification.  Test  for  period  and  the  interaction  of 
period  with  the  between-subjects  factors  and  interactions. 

6.33  Carry  out  a  profile  analysis  on  the  fish  data  in  Table  6. 17,  testing  for  parallelism, 
equal  levels,  and  flatness. 

6.34  Rao  (1948)  measured  the  weight  of  cork  borings  taken  from  the  north  (N), 
east  (E),  south  (S),  and  west  (W)  directions  of  28  trees.  The  data  are  given  in 
Table  6.21.  It  is  of  interest  to  compare  the  bark  thickness  (and  hence  weight) 
in  the  four  directions.  This  can  be  done  by  analyzing  the  data  as  a  one-sample 
repeated  measures  design.  Since  the  primary  comparison  of  interest  is  north 
and  south  vs.  east  and  west,  use  the  contrast  matrix 
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Table  6.21.  Weights  of  Cork  Borings  (eg)  in  Four  Directions  for  28  Trees 


Tree 

N 

E 

S 

W 

Tree 

N 

E 

S 

W 

1 

72 

66 

76 

77 

15 

91 

79 

100 

75 

2 

60 

53 

66 

63 

16 

56 

68 

47 

50 

3 

56 

57 

64 

58 

17 

79 

65 

70 

61 

4 

41 

29 

36 

38 

18 

81 

80 

68 

58 

5 

32 

32 

35 

36 

19 

78 

55 

67 

60 

6 

30 

35 

34 

26 

20 

46 

38 

37 

38 

7 

39 

39 

31 

27 

21 

39 

35 

34 

37 

8 

42 

43 

31 

25 

22 

32 

30 

30 

32 

9 

37 

40 

31 

25 

23 

60 

50 

67 

54 

10 

33 

29 

27 

36 

24 

35 

37 

48 

39 

11 

32 

30 

34 

28 

25 

39 

36 

39 

31 

12 

63 

45 

74 

63 

26 

50 

34 

37 

40 

13 

54 

46 

60 

52 

27 

43 

37 

39 

50 

14 

47 

51 

52 

43 

28 

48 

54 

57 

43 

(a)  Test  Ho :  hn  —  HE  —  Hs  —  Hw  using  the  entire  matrix  C. 

(b)  If  the  test  in  (a)  rejects  Hq ,  test  each  row  of  C. 

6.35  Analyze  the  glucose  data  in  Table  3.8  as  a  one-sample  repeated  measures 
design  with  two  within-subjects  factors.  Factor  A  is  a  comparison  of  fasting 
test  vs.  1  hour  posttest.  The  three  levels  of  factor  B  are  y \  (and  x\),  yi  (and 
xi),  and  \>3  (and  X3). 

6.36  Table  6.22  gives  survival  times  for  cancer  patients  (Cameron  and  Pauling  1978; 
see  also  Andrews  and  Herzberg  1985,  pp.  203-206).  The  factors  in  this  two- 
way  design  are  gender  (1  =  male,  2  =  female)  and  type  of  cancer  (1  = 
stomach.  2  =  bronchus,  3  =  colon,  4  =  rectum,  5  =  bladder,  6  =  kidney). 
The  variables  (repeated  measures)  are  vi  =  survival  time  (days)  of  patient 


Table  6.22.  Survival  Times  for  Cancer  Patients 


Type  of 
Cancer 

Gender 

Age 

y\ 

yi 

y-i 

V4 

1 

2 

61 

124 

264 

124 

38 

1 

1 

69 

42 

62 

12 

18 

1 

2 

62 

25 

149 

19 

36 

1 

2 

66 

45 

18 

45 

12 

1 

1 

63 

412 

180 

257 

64 

1 

1 

79 

51 

142 

23 

20 

1 

1 

76 

1112 

35 

128 

13 
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Table  6.22.  ( Continued ) 


Type  of 
Cancer 

Gender 

Age 

yi 

,V2 

T3 

V4 

1 

1 

54 

46 

299 

46 

51 

1 

1 

62 

103 

85 

90 

10 

1 

1 

46 

146 

361 

123 

52 

1 

1 

57 

340 

269 

310 

28 

1 

2 

59 

396 

130 

359 

55 

2 

1 

74 

81 

72 

74 

33 

2 

1 

74 

461 

134 

423 

18 

2 

1 

66 

20 

84 

16 

20 

2 

1 

52 

450 

98 

450 

58 

2 

2 

48 

246 

48 

87 

13 

2 

2 

64 

166 

142 

115 

49 

2 

1 

70 

63 

113 

50 

38 

2 

1 

77 

64 

90 

50 

24 

2 

1 

71 

155 

30 

113 

18 

2 

1 

39 

151 

260 

38 

34 

2 

1 

70 

166 

116 

156 

20 

2 

1 

70 

37 

87 

27 

27 

2 

1 

55 

223 

69 

218 

32 

2 

1 

74 

138 

100 

138 

27 

2 

1 

69 

72 

315 

39 

39 

2 

1 

73 

245 

188 

231 

65 

3 

2 

76 

248 

292 

135 

18 

3 

2 

58 

377 

492 

50 

30 

3 

1 

49 

189 

462 

189 

65 

3 

1 

69 

1843 

235 

1267 

17 

3 

2 

70 

180 

294 

155 

57 

3 

2 

68 

537 

144 

534 

16 

3 

1 

50 

519 

643 

502 

25 

3 

2 

74 

455 

301 

126 

21 

3 

1 

66 

406 

148 

90 

17 

3 

2 

76 

365 

641 

365 

42 

3 

2 

56 

942 

272 

911 

40 

3 

2 

74 

372 

37 

366 

28 

3 

1 

58 

163 

199 

156 

31 

3 

2 

60 

101 

154 

99 

28 

3 

1 

77 

20 

649 

20 

33 

3 

1 

38 

283 

162 

274 

80 

4 

2 

56 

185 

422 

62 

38 

4 

2 

75 

479 

82 

226 

10 

4 

2 

57 

875 

551 

437 

62 

4 

1 

56 

115 

140 

85 

13 

4 

1 

68 

362 

106 

122 

36 

4 

1 

54 

241 

645 

198 

80 

4 

1 

59 

2175 

407 

759 

64 

5 

1 

93 

4288 

464 

260 

29 
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Table  6.22.  ( Continued ) 


Type  of 
Cancer 

Gender 

Age 

y\ 

,V2 

y'i 

V4 

5 

2 

70 

3658 

694 

305 

22 

5 

2 

77 

51 

221 

37 

21 

5 

2 

72 

278 

490 

109 

16 

5 

1 

44 

548 

433 

37 

11 

6 

2 

71 

205 

332 

8 

91 

6 

2 

63 

538 

377 

96 

47 

6 

2 

51 

203 

147 

190 

35 

6 

1 

53 

296 

500 

64 

34 

6 

1 

57 

870 

299 

260 

19 

6 

1 

73 

331 

585 

326 

37 

6 

1 

69 

1685 

1056 

46 

15 

treated  with  ascorbate  measured  from  date  of  first  hospital  attendance,  yi  = 
mean  survival  time  for  the  patient’s  10  matched  controls  (untreated  with  ascor¬ 
bate),  J3  =  survival  time  after  ascorbate  treatment  ceased,  and  V4  =  mean 
survival  time  after  all  treatment  ceased  for  the  patient’s  10  matched  controls. 
Analyze  as  a  repeated  measures  design  with  one  within-subjects  factor  (yi,  yi, 
y3,  >’4)  and  a  two-way  (unbalanced)  design  between  subjects.  Since  the  two- 
way  classification  of  subjects  is  unbalanced,  you  will  need  to  use  a  program 
that  allows  for  this  or  delete  some  observations  to  achieve  a  balanced  design. 

6.37  Analyze  the  ramus  bone  data  of  Table  3.8  as  a  one-sample  growth  curve  design. 

(a)  Using  a  matrix  C  of  orthogonal  polynomial  contrasts,  test  the  hypothesis 
of  overall  equality  of  means,  H$ :  C/x  =  0. 


Table  6.23.  Weights  of  13  Male  Mice  Measured  at  Successive  Intervals  of  3  Days  over  21 
Days  from  Birth  to  Weaning 


Mouse 

Day  3 

Day  6 

Day  9 

Day  12 

Day  15 

Day  18 

Day  21 

1 

.190 

.388 

.621 

.823 

1.078 

1.132 

1.191 

2 

.218 

.393 

.568 

.729 

.839 

.852 

1.004 

3 

.211 

.394 

.549 

.700 

.783 

.870 

.925 

4 

.209 

.419 

.645 

.850 

1.001 

1.026 

1.069 

5 

.193 

.362 

.520 

.530 

.641 

.640 

.751 

6 

.201 

.361 

.502 

.530 

.657 

.762 

.888 

7 

.202 

.370 

.498 

.650 

.795 

.858 

.910 

8 

.190 

.350 

.510 

.666 

.819 

.879 

.929 

9 

.219 

.399 

.578 

.699 

.709 

.822 

.953 

10 

.225 

.400 

.545 

.690 

.796 

.825 

.836 

11 

.224 

.381 

.577 

.756 

.869 

.929 

.999 

12 

.187 

.329 

.441 

.525 

.589 

.621 

.796 

13 

.278 

.471 

.606 

.770 

.888 

1.001 

1.105 
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(b)  If  the  overall  hypothesis  in  (a)  is  rejected,  find  the  degree  of  growth  curve 
by  testing  each  row  of  C. 

6.38  Table  6.23  contains  the  weights  of  13  male  mice  measured  every  3  days  from 
birth  to  weaning.  The  data  set  was  reported  and  analyzed  by  Williams  and 
Izenman  (1981)  and  by  Izenman  and  Williams  (1989)  and  has  been  further 
analyzed  by  Rao  (1984,  1987)  and  by  Lee  (1988).  Analyze  as  a  one-sample 
growth  curve  design. 

(a)  Using  a  matrix  C  of  orthogonal  polynomial  contrasts,  test  the  hypothesis 
of  overall  equality  of  means,  Hq  :  Cfi  —  0. 

(b)  If  the  overall  hypothesis  in  (a)  is  rejected,  find  the  degree  of  growth  curve 
by  testing  each  row  of  C. 

6.39  In  Table  6.24,  we  have  measurements  of  proportions  of  albumin  at  four  time 
points  on  three  groups  of  trout  (Beauchamp  and  Hoel  1974). 

(a)  Using  a  matrix  C  of  orthogonal  contrasts,  test  the  hypothesis  of  overall 
equality  of  means,  Hq  :  CJZ.  =  0,  for  the  combined  samples,  as  in  Sec¬ 
tion  6.10.2. 

(b)  If  the  overall  hypothesis  is  rejected,  find  the  degree  of  growth  curve  for  the 
combined  samples  by  testing  each  row  of  C. 

(c)  Compare  the  three  groups  using  the  entire  matrix  C. 

(d)  Compare  the  three  groups  using  each  row  of  C. 


Table  6.24.  Measurements  of  Trout 


Group 

Time  Point 

1 

2 

3 

4 

1 

.257 

.288 

.328 

.358 

1 

.266 

.282 

.315 

.464 

1 

.256 

.303 

.293 

.261 

1 

.272 

.456 

.288 

.261 

2 

.312 

.300 

.273 

.253 

2 

.253 

.220 

.314 

.261 

2 

.239 

.261 

.279 

.224 

2 

.254 

.243 

.304 

.254 

3 

.272 

.279 

.259 

.295 

3 

.246 

.292 

.279 

.302 

3 

.262 

.311 

.263 

.264 

3 

.292 

.261 

.314 

.244 

6.40  Table  6.25  contains  weight  gains  for  three  groups  of  rats  (Box  1950). 

The  variables  are  y,  =  gain  in  z'th  week,  i  =  1,  2,  3,  4. 

The  groups  are  1  =  controls,  2  =  thyroxin  added  to  drinking  water,  and  3  = 
thiouracil  added  to  drinking  water. 
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Table  6.25.  Weekly  Gains  in  Weight  for  27  Rats 


Group  1 

Group  2 

Group  3 

Rat 

yi 

yi 

ys 

v4 

Rat 

yi 

,V2 

,V3 

,V4 

Rat 

y\ 

yi 

y4 

1 

29 

28 

25 

33 

11 

26 

36 

35 

35 

18 

25 

23 

11 

9 

2 

33 

30 

23 

31 

12 

17 

19 

20 

28 

19 

21 

21 

10 

11 

3 

25 

34 

33 

41 

13 

19 

33 

43 

38 

20 

26 

21 

6 

27 

4 

18 

33 

29 

35 

14 

26 

31 

32 

29 

21 

29 

12 

11 

11 

5 

25 

23 

17 

30 

15 

15 

25 

23 

24 

22 

24 

26 

22 

17 

6 

24 

32 

29 

22 

16 

21 

24 

19 

24 

23 

24 

17 

8 

19 

7 

20 

23 

16 

31 

17 

18 

35 

33 

33 

24 

22 

17 

8 

5 

8 

28 

21 

18 

24 

25 

11 

24 

21 

24 

9 

18 

23 

22 

28 

26 

15 

17 

12 

17 

10 

25 

28 

29 

30 

27 

19 

17 

15 

18 

(a)  Using  a  matrix  C  of  orthogonal  contrasts,  test  the  hypothesis  of  overall 
equality  of  means.  Ho :  CJ I.  =  0,  for  the  combined  samples,  as  in  Sec¬ 
tion  6.10.2. 

(b)  If  the  overall  hypothesis  is  rejected,  find  the  degree  of  growth  curve  for  the 
combined  samples  by  testing  each  row  of  C. 

(c)  Compare  the  three  groups  using  the  entire  matrix  C. 

(d)  Compare  the  three  groups  using  each  row  of  C. 

6.41  Table  6.26  contains  measurements  of  coronary  sinus  potassium  at  2-min  inter¬ 
vals  after  coronary  occlusion  on  four  groups  of  dogs  (Grizzle  and  Allen  1969). 
The  groups  are  1  =  control  dogs,  2  =  dogs  with  extrinsic  cardiac  denervation 
3  wk  prior  to  coronary  occlusion,  3  =  dogs  with  extrinsic  cardiac  denervation 
immediately  prior  to  coronary  occlusion,  and  4  =  dogs  with  bilateral  thoracic 
sympathectomy  and  stellectomy  3  wk  prior  to  coronary  occlusion. 


Table  6.26.  Coronary  Sinus  Potassium  Measured  at 
2-min  Intervals  on  Dogs 


Group 

Time 

1 

3 

5 

7 

9 

11 

13 

1 

4.0 

4.0 

4.1 

3.6 

3.6 

3.8 

3.1 

1 

4.2 

4.3 

3.7 

3.7 

4.8 

5.0 

5.2 

1 

4.3 

4.2 

4.3 

4.3 

4.5 

5.8 

5.4 

1 

4.2 

4.4 

4.6 

4.9 

5.3 

5.6 

4.9 

1 

4.6 

4.4 

5.3 

5.6 

5.9 

5.9 

5.3 

1 

3.1 

3.6 

4.9 

5.2 

5.3 

4.2 

4.1 

1 

3.7 

3.9 

3.9 

4.8 

5.2 

5.4 

4.2 

1 

4.3 

4.2 

4.4 

5.2 

5.6 

5.4 

4.7 

1 

4.6 

4.6 

4.4 

4.6 

5.4 

5.9 

5.6 

2 

3.4 

3.4 

3.5 

3.1 

3.1 

3.7 

3.3 
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Table  6.26.  ( Continued ) 


Time 


Group 

1 

3 

5 

7 

9 

11 

13 

2 

3.0 

3.1 

3.2 

3.0 

3.3 

3.0 

3.0 

2 

3.0 

3.2 

3.0 

3.0 

3.1 

3.2 

3.1 

2 

3.1 

3.2 

3.2 

3.2 

3.3 

3.1 

3.1 

2 

3.8 

3.9 

4.0 

2.9 

3.5 

3.5 

3.4 

2 

3.0 

3.6 

3.2 

3.1 

3.0 

3.0 

3.0 

2 

3.3 

3.3 

3.3 

3.4 

3.6 

3.1 

3.1 

2 

4.2 

4.0 

4.2 

4.1 

4.2 

4.0 

4.0 

2 

4.1 

4.2 

4.3 

4.3 

4.2 

4.0 

4.2 

2 

4.5 

4.4 

4.3 

4.5 

5.3 

4.4 

4.4 

3 

3.2 

3.3 

3.8 

3.8 

4.4 

4.2 

3.7 

3 

3.3 

3.4 

3.4 

3.7 

3.7 

3.6 

3.7 

3 

3.1 

3.3 

3.2 

3.1 

3.2 

3.1 

3.1 

3 

3.6 

3.4 

3.5 

4.6 

4.9 

5.2 

4.4 

3 

4.5 

4.5 

5.4 

5.7 

4.9 

4.0 

4.0 

3 

3.7 

4.0 

4.4 

4.2 

4.6 

4.8 

5.4 

3 

3.5 

3.9 

5.8 

5.4 

4.9 

5.3 

5.6 

3 

3.9 

4.0 

4.1 

5.0 

5.4 

4.4 

3.9 

4 

3.1 

3.5 

3.5 

3.2 

3.0 

3.0 

3.2 

4 

3.3 

3.2 

3.6 

3.7 

3.7 

4.2 

4.4 

4 

3.5 

3.9 

4.7 

4.3 

3.9 

3.4 

3.5 

4 

3.4 

3.4 

3.5 

3.3 

3.4 

3.2 

3.4 

4 

3.7 

3.8 

4.2 

4.3 

3.6 

3.8 

3.7 

4 

4.0 

4.6 

4.8 

4.9 

5.4 

5.6 

4.8 

4 

4.2 

3.9 

4.5 

4.7 

3.9 

3.8 

3.7 

4 

4.1 

4.1 

3.7 

4.0 

4.1 

4.6 

4.7 

4 

3.5 

3.6 

3.6 

4.2 

4.8 

4.9 

5.0 

(a)  Using  a  matrix  C  of  orthogonal  contrasts,  test  the  hypothesis  of  overall 
equality  of  means,  Hq  :  CJZ  —  0,  for  the  combined  samples,  as  in  Sec¬ 
tion  6.10.2. 

(b)  If  the  overall  hypothesis  is  rejected,  find  the  degree  of  growth  curve  for  the 
combined  samples  by  testing  each  row  of  C. 

(c)  Compare  the  four  groups  using  the  entire  matrix  C. 

(d)  Compare  the  four  groups  using  each  row  of  C. 

6.42  Table  6.27  contains  blood  pressure  measurements  at  intervals  after  inducing  a 
heart  attack  for  four  groups  of  rats:  group  1  is  the  controls  and  groups  2-4  have 
been  exposed  to  halothane  concentrations  of  .25%,  .50%,  1.0%,  respectively 
(Crepeau  et  al.  1985). 

(a)  Find  the  degree  of  growth  curve  for  the  combined  sample  using  the  meth¬ 
ods  in  (6. 1 1 3)-(6. 1 1 8). 
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Table  6.27.  Blood  Pressure  Data 


Group 

Number  of  Minutes  after  Ligation 

1 

5 

10 

15 

30 

60 

1 

112.5 

100.5 

102.5 

102.5 

107.5 

107.5 

1 

92.5 

102.5 

105.0 

100.0 

110.0 

117.5 

1 

132.5 

125.0 

115.0 

112.5 

110.0 

110.0 

1 

102.5 

107.5 

107.5 

102.5 

90.0 

112.5 

1 

110.0 

130.0 

115.0 

105.0 

112.5 

110.0 

1 

97.5 

97.5 

80.0 

82.5 

82.5 

102.5 

1 

90.0 

70.0 

85.0 

85.0 

92.5 

97.5 

2 

115.0 

115.0 

107.5 

107.5 

112.5 

107.5 

2 

125.0 

125.0 

120.0 

120.0 

117.5 

125.0 

2 

95.0 

90.0 

95.0 

90.0 

100.0 

107.5 

2 

87.5 

65.5 

85.0 

90.0 

105.0 

90.0 

2 

90.0 

87.5 

97.5 

95.0 

100.0 

95.0 

2 

97.5 

92.5 

57.5 

55.0 

90.0 

97.5 

2 

107.5 

107.5 

145.0 

110.0 

105.0 

112.5 

2 

102.5 

130.0 

85.0 

80.0 

127.5 

97.5 

3 

107.5 

107.5 

102.5 

102.5 

102.5 

97.5 

3 

97.5 

108.5 

94.5 

102.5 

102.5 

107.5 

3 

100.0 

105.0 

105.0 

105.0 

110.0 

110.0 

3 

95.0 

95.0 

90.0 

100.0 

100.0 

100.0 

3 

85.0 

92.5 

92.5 

92.5 

90.0 

110.0 

3 

82.5 

77.5 

75.0 

65.5 

65.0 

72.5 

3 

62.5 

75.0 

115.0 

110.0 

100.0 

100.0 

4 

70.0 

67.5 

67.5 

77.5 

77.5 

77.5 

4 

45.0 

37.5 

45.0 

45.0 

47.5 

45.0 

4 

52.5 

22.5 

90.0 

65.0 

60.0 

65.5 

4 

100.0 

100.0 

100.0 

100.0 

97.5 

92.5 

4 

115.0 

110.0 

100.0 

110.0 

105.0 

105.0 

4 

97.5 

97.5 

97.5 

105.0 

95.0 

92.5 

4 

95.0 

125.0 

130.0 

125.0 

115.0 

117.5 

4 

72.5 

87.5 

65.0 

57.5 

92.5 

82.5 

4 

105.0 

105.0 

105.0 

105.0 

102.5 

100.0 

(b)  Repeat  (a)  for  group  1 . 

(c)  Repeat  (a)  for  groups  2-4  combined. 

6.43  Table  6.28,  from  Zerbe  (1979a),  compares  13  control  and  20  obese  patients  on 
a  glucose  tolerance  test  using  plasma  inorganic  phosphate.  Delete  the  obser¬ 
vations  corresponding  to  ^  and  I  j  hours  so  that  the  time  points  are  equally 
spaced. 

(a)  For  the  control  group,  use  orthogonal  polynomials  to  find  the  degree  of 
growth  curve. 

(b)  Repeat  (a)  for  the  obese  group. 
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Table  6.28.  Plasma  Inorganic  Phosphate  (mg/dl) 


Hours  after  Glucose  Challenge 

Patient 

0 

1 

2 

1 

4 

2 

3 

4 

5 

1 

4.3 

3.3 

3.0 

2.6 

Control 

2.2 

2.5 

3.4 

4.4° 

2 

3.7 

2.6 

2.6 

1.9 

2.9 

3.2 

3.1 

3.9 

3 

4.0 

4.1 

3.1 

2.3 

2.9 

3.1 

3.9 

4.0 

4 

3.6 

3.0 

2.2 

2.8 

2.9 

3.9 

3.8 

4.0 

5 

4.1 

3.8 

2.1 

3.0 

3.6 

3.4 

3.6 

3.7 

6 

3.8 

2.2 

2.0 

2.6 

3.8 

3.6 

3.0 

3.5 

7 

3.8 

3.0 

2.4 

2.5 

3.1 

3.4 

3.5 

3.7 

8 

4.4 

3.9 

2.8 

2.1 

3.6 

3.8 

4.0 

3.9 

9 

5.0 

4.0 

3.4 

3.4 

3.3 

3.6 

4.0 

4.3 

10 

3.7 

3.1 

2.9 

2.2 

1.5 

2.3 

2.7 

2.8 

11 

3.7 

2.6 

2.6 

2.3 

2.9 

2.2 

3.1 

3.9 

12 

4.4 

3.7 

3.1 

3.2 

3.7 

4.3 

3.9 

4.8 

13 

4.7 

3.1 

3.2 

3.3 

3.2 

4.2 

3.7 

4.3 

1 

4.3 

3.3 

3.0 

2.6 

Obese 

2.2 

2.5 

2.4 

3.4° 

2 

5.0 

4.9 

4.1 

3.7 

3.7 

4.1 

4.7 

4.9 

3 

4.6 

4.4 

3.9 

3.9 

3.7 

4.2 

4.8 

5.0 

4 

4.3 

3.9 

3.1 

3.1 

3.1 

3.1 

3.6 

4.0 

5 

3.1 

3.1 

3.3 

2.6 

2.6 

1.9 

2.3 

2.7 

6 

4.8 

5.0 

2.9 

2.8 

2.2 

3.1 

3.5 

3.6 

7 

3.7 

3.1 

3.3 

2.8 

2.9 

3.6 

4.3 

4.4 

8 

5.4 

4.7 

3.9 

4.1 

2.8 

3.7 

3.5 

3.7 

9 

3.0 

2.5 

2.3 

2.2 

2.1 

2.6 

3.2 

3.5 

10 

4.9 

5.0 

4.1 

3.7 

3.7 

4.1 

4.7 

4.9 

11 

4.8 

4.3 

4.7 

4.6 

4.7 

3.7 

3.6 

3.9 

12 

4.4 

4.2 

4.2 

3.4 

3.5 

3.4 

3.9 

4.0 

13 

4.9 

4.3 

4.0 

4.0 

3.3 

4.1 

4.2 

4.3 

14 

5.1 

4.1 

4.6 

4.1 

3.4 

4.2 

4.4 

4.9 

15 

4.8 

4.6 

4.6 

4.4 

4.1 

4.0 

3.8 

3.8 

16 

4.2 

3.5 

3.8 

3.6 

3.3 

3.1 

3.5 

3.9 

17 

6.6 

6.1 

5.2 

4.1 

4.3 

3.8 

4.2 

4.8 

18 

3.6 

3.4 

3.1 

2.8 

2.1 

2.4 

2.5 

3.5 

19 

4.5 

4.0 

3.7 

3.3 

2.4 

2.3 

3.1 

3.3 

20 

4.6 

4.4 

3.8 

3.8 

3.8 

3.6 

3.8 

3.8 

The  similarity  in  the  data  for  patient  1  in  the  control  group  and  patient  1  in  the  obese  group  is  coincidental. 


PROBLEMS 


247 


Table  6.29.  Mandible  Measurements 


Activator  Treatment 


Subject 

1 

2 

3 

yi 

y2 

yi 

yi 

yi 

J3 

,Vl 

y2 

J3 

1 

117.0 

117.5 

118.5 

59.0 

59.0 

60.0 

10.5 

16.5 

16.5 

2 

109.0 

110.5 

111.0 

60.0 

61.5 

61.5 

30.5 

30.5 

30.5 

3 

117.0 

120.0 

120.5 

60.0 

61.5 

62.0 

23.5 

23.5 

23.5 

4 

122.0 

126.0 

127.0 

67.5 

70.5 

71.5 

33.0 

32.0 

32.5 

5 

116.0 

118.5 

119.5 

61.5 

62.5 

63.5 

24.5 

24.5 

24.5 

6 

123.0 

126.0 

127.0 

65.5 

61.5 

67.5 

22.0 

22.0 

22.0 

7 

130.5 

132.0 

134.5 

68.5 

69.5 

71.0 

33.0 

32.5 

32.0 

8 

126.5 

128.5 

130.5 

69.0 

71.0 

73.0 

20.0 

20.0 

20.0 

9 

113.0 

116.5 

118.0 

58.0 

59.0 

60.5 

25.0 

25.0 

24.5 

1 

128.0 

129.0 

131.5 

67.0 

67.5 

69.0 

24.0 

24.0 

24.0 

2 

116.5 

120.0 

121.5 

63.5 

65.0 

66.0 

28.5 

29.5 

29.5 

3 

121.5 

125.5 

127.0 

64.5 

67.5 

69.0 

26.5 

27.0 

27.0 

4 

109.5 

112.0 

114.0 

54.0 

55.5 

57.0 

18.0 

18.5 

19.0 

5 

133.0 

136.0 

137.5 

72.0 

73.5 

75.5 

34.5 

34.5 

34.5 

6 

120.0 

124.5 

126.0 

62.5 

65.0 

66.0 

26.0 

26.0 

26.0 

7 

129.5 

133.5 

134.5 

65.0 

68.0 

69.0 

18.5 

18.5 

18.5 

8 

122.0 

124.0 

125.5 

64.5 

65.5 

66.0 

18.5 

18.5 

18.5 

9 

125.0 

127.0 

128.0 

65.5 

66.5 

67.0 

21.5 

21.5 

21.6 

(c)  Find  the  degree  of  growth  curve  for  the  combined  groups,  and  compare  the 
growth  curves  of  the  two  groups. 

6.44  Consider  the  complete  data  from  Table  6.28  including  the  observations  corre¬ 
sponding  to  j  and  1 hours.  Use  the  methods  in  (6. 1 1 3)— (6. 118)  for  unequally 
spaced  time  points  to  analyze  each  group  separately  and  the  combined  groups. 

6.45  Table  6.29  contains  mandible  measurements  (Timm  1980).  There  were  two 
groups  of  subjects.  Each  subject  was  measured  at  three  time  points  y\,  yj,  and 
V3  for  each  of  three  types  of  activator  treatment.  Analyze  as  a  repeated  mea¬ 
sures  design  with  two  within- subjects  factors  and  one  between-subjects  factor. 
Use  linear  and  quadratic  contrasts  for  time  (growth  curve). 


CHAPTER  7 


Tests  on  Covariance  Matrices 


7.1  INTRODUCTION 

We  now  consider  tests  of  hypotheses  involving  the  variance-covariance  structure. 
These  tests  are  often  carried  out  to  check  assumptions  pertaining  to  other  tests.  In 
Sections  7. 2-7. 4,  we  cover  three  basic  types  of  hypotheses:  (1)  the  covariance  matrix 
has  a  particular  structure,  (2)  two  or  more  covariance  matrices  are  equal,  and  (3)  cer¬ 
tain  elements  of  the  covariance  matrix  are  zero,  thus  implying  independence  of  the 
corresponding  (multivariate  normal)  random  variables.  In  most  cases  we  use  the  like¬ 
lihood  ratio  approach  (Section  5.4.3).  The  resulting  test  statistics  often  involve  the 
ratio  of  the  determinants  of  the  sample  covariance  matrix  under  the  null  hypothesis 
and  under  the  alternative  hypothesis. 


7.2  TESTING  A  SPECIFIED  PATTERN  FOR  £ 

In  this  section,  the  discussion  is  in  terms  of  a  sample  covariance  matrix  S  from  a 
single  sample.  However,  the  tests  can  be  applied  to  a  sample  covariance  matrix  Spi  = 
E/ve  obtained  by  pooling  across  several  samples.  To  allow  for  either  possibility,  the 
degrees-of-freedom  parameter  has  been  indicated  by  v.  For  a  single  sample,  v  = 
n  —  1;  for  a  pooled  covariance  matrix,!)  =  (ni  ~  1)  —  n>  —  k  —  N  —  k. 


7.2.1  Testing  Ho :  £  =  So 

We  begin  with  the  basic  hypothesis  Hq:  S  =  So  vs.  II\  \  £  ^  So-  The  hypothe¬ 
sized  covariance  matrix  So  is  a  target  value  for  S  or  a  nominal  value  from  previous 
experience.  Note  that  So  is  completely  specified  in  Ho,  whereas  /x  is  not  specified. 

To  test  Ho,  we  obtain  a  random  sample  of  n  observation  vectors  yi,  y2,  ■ . .  ,  v„ 
from  N pifjL.  S)  and  calculate  S.  To  see  if  S  is  significantly  different  from  So,  we 
use  the  following  test  statistic,  which  is  a  modification  of  the  likelihood  ratio  (Sec- 
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tion  5.4.3): 


u  =  v[  In  |2o|  -  In  | S |  +  tr(S20  *)  -  p\,  (7.1) 

where  v  represents  the  degrees  of  freedom  of  S  (seecomments  at  the  beginning  of 
Section  7.2),  In  is  the  natural  logarithm  (base  e ),  and  tr  is  the  trace  of  a  matrix  (Sec¬ 
tion  2.9).  Note  that  if  S  =  2o,  then  u  —  0;  otherwise  u  increases  with  the“distance” 
between  S  and  2o  [see(7.4)  and  the  comment  following]. 

When  v  is  large,  the  statistic  u  in  (7.1)  is  approximately  distributed  as  x2[jp(p  + 
1)]  if//o  is  true.  For  moderate  size  v. 


it 


/ 


1 

6v  —  1 


2p  +  1  — 


u 


(7.2) 


is  a  better  approximation  to  the  x2[^P(P  +  1)]  distribution.  We  reject  Hq  if  u  or  it'  is 
greater  than/2 [a,  \p{p  +  1)].  Note  that  the  degrees  of  freedom  for  the  /2-statistic, 
\ p( p  +  1),  is  the  number  of  distinct  parameters  in  X. 

We  can  express  it  in  terms  of  the  eigenvalues  A.i,  A.2,  •  •  •  ,  of  SXq  1  by  noting 
that  trfSXg1)  anc*  l^ol  —  In  |S|  become 


trlSXo1)  =  X> 

i=i 

In  |Xol  -  In  |S|  =  -  In  |Xo|  — 1  -  In  |S| 
=  —  In  |SXq  1 1 

—(M 


from  which  (7.1)  can  be  written  as 


[by  (2.107)], 


[by  (2.89)  and  (2.91)]  (7.3) 

[by  (2.108)], 


u  —  v 


■  In  kij  P 


fjM  ~  In  A.;)  - 
;  =  1 


(7.4) 


A  plot  of  y  =  x  —  In  x  will  show  that  x  —  In  x  >  1  for  all  x  >  0,  with  equality 
holding  only  for  r  =  1.  Thus  —  In  /.,)  >  p  and  u  >  0. 

The  hypothesis  that  the  variables  are  independent  and  have  unit  variance, 


H0:X  =  I, 

can  be  tested  by  simply  setting  Xo  =  I  in  (7.1). 
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7.2.2  Testing  Sphericity 

The  hypothesis  that  the  variables  vi,  yi, ...  ,yp  in  y  are  independent  and  have  the 
same  variance  can  be  expressed  as  Ho :  X  =  cr2I  versus  H p  S  /  er2I,  where  a2  is 
the  unknown  common  variance.  This  hypothesis  is  of  interest  in  repeated  measures 
(see  Section  6.9.1).  Under  Ho,  the  ellipsoid  (y  —  /x)'S— 1  ( y  —  p.)  —  c2  reduces  to 
(y— p)'(y—  fi)  =  er2c2.  the  equation  of  a  sphere;  hence  the  term  sphericity  is  applied 
to  the  covariance  structure  X  =  er2I.  Another  sphericity  hypothesis  of  interest  in 
repeated  measures  is  Ho :  CXC'  =  cr2I,  where  C  is  any  full-rank  (p  —  1)  x  p  matrix 
of  orthonormal  contrasts  (see  Section  6.9.1). 

For  a  random  sample  yi,  y2, . . .  ,  y n  from  Np ( /x .  X),  the  likelihood  ratio  for  test¬ 
ing  Hq  :  X  =  cr2 I  is 


LR  = 


|S| 


n/2 


(trS /p)P 


(7.5) 


In  some  cases  that  we  have  considered  previously,  the  likelihood  ratio  is  a  simple 
function  of  a  test  statistic  such  as  F,  T2 ,  Wilks’  A,  and  so  on.  However,  LR  in  (7.5) 
does  not  reduce  to  a  standard  statistic,  and  we  resort  to  an  approximation  for  its 
distribution.  It  has  been  shown  that  for  a  general  likelihood  ratio  statistic  LR, 

—2  In  (LR)  is  approximately  y2  (7.6) 


for  large  n,  where  v  is  the  total  number  of  parameters  minus  the  number  estimated 
under  the  restrictions  imposed  by  Hq. 

For  the  likelihood  ratio  statistic  in  (7.5),  we  obtain 


— 21n(LR)  =  —n  In 


|S| 

(trS /p)P 


— n  In  w, 


where 


u  =  (LR)2/" 


Pp  |S| 
(trS)f 


(7.7) 


By  (2.107)  and  (2.108),  u  becomes 

ppUL^ 

(ZLW 

where  Li,  X2, . . .  ,Xp  are  the  eigenvalues  of  S.  An  improvement 
given  by 


(7.8) 

over  —n  In  u  is 


r 


2p2  +  p  +  2 


In  u, 


u 


v  — 


6  P 


(7.9) 
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where  v  is  the  degrees  of  freedom  for  S  (see  comments  at  the  beginning  of  Section 
7.2).  The  statistic  u  has  an  approximate  ^-distribution  with  j p{p+ 1)—  1  degrees  of 
freedom.  We  reject  Hq  if  u!  >  /2[a,  \ pip  +  1)  —  1],  As  noted  before,  the  degrees  of 
freedom  in  the  / 2 -approximation  is  equal  to  the  total  number  of  parameters  minus 
the  number  of  parameters  estimated  under  Hq.  The  number  of  parameters  in  X  is 
p  +  (^)  =  \p{p  +  1),  and  the  loss  of  1  degree  of  freedom  is  due  to  estimation  of  a2. 

We  see  from  (7.8)  and  (7.9)  that  if  the  sample  A.,  ’s  are  all  equal,  a  —  I  and 
u'  =  0.  Hence,  this  statistic  also  tests  the  hypothesis  of  equality  of  the  population 
eigenvalues. 

To  test  Hq  :  CSC'  =  a2I,  use  CSC'  in  place  of  S  in  (7.7)  and  use  p  —  1  in  place 
of  p  in  (7.7)— (7.9)  and  in  the  degrees  of  freedom  for  y2. 

The  likelihood  ratio  (7.5)  was  first  given  by  Mauchly  (1940),  and  his  name  is  often 
associated  with  this  test.  Nagarsenker  and  Pillai  (1973)  gave  the  exact  distribution  of 
u  and  provided  a  table  for  p  =  4,  5, . . .  ,  10.  Venables  (1976)  showed  that  u  can  be 
obtained  by  a  union-intersection  approach  (Section  6.1.4). 


Example  7.2.2.  We  use  the  probe  word  data  in  Table  3.5  to  illustrate  tests  of  spheric¬ 
ity.  The  five  variables  appear  to  be  commensurate,  and  the  hypothesis  Hq:  — 

H2  =  •  ■  ■  =  P5  may  be  of  interest.  We  would  expect  the  variables  to  be  corre¬ 
lated,  and  Hq  would  ordinarily  be  tested  using  a  multivariate  approach,  as  in  Sec¬ 
tions  5.9.1  and  6.9.2.  However,  if  X  =  a2 1  or  CSC'  =  a2I,  then  the  hypothesis 
Hq:  p i  =  H2  =  ■  ■  ■  =  1 15  can  be  tested  with  a  univariate  ANOVA  E-test  (see 
Section  6.9.1). 

We  first  test  Hq:  X  =  er2I.  The  sample  covariance  matrix  S  was  obtained  in 
Example  3.9.1.  By  (7.7), 

_  P;,|S|  _  55(27,  236,  586)  _ 

U  (trS)P  (292.89 1)5 

Then  by  (7.9),  with  ft  =  1 1  and  p  =  5,  we  have 


ft 


/ 


2  p2  +  p  +  2\ 
6 ~P  / 


In  u  =  26.177. 


The  approximate  x2-test  has  jp(p  +  1)  —  1  =  14  degrees  of  freedom.  We  therefore 
compare  uf  —  26.177  with  Xq5  14  =  23.68  and  reject  Hq:  X  —  a2 1. 

To  test  Hq:  CSC'  =  a2I,  we  use  the  following  matrix  of  orthonormalized  con¬ 
trasts: 

/  4/V20  -1/V20  -1/V20  -1/V20  -1/V20  \ 

0  3/VT2  -1/VI2  -1/VT2  -1/V12 

0  0  2/V6  -1/V6  -1/V6  ' 

\  0  0  0  1/V2  -1/V2  / 
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Then  using  CSC'  in  place  of  S  and  with  p  —  1=4  for  the  four  rows  of  C,  we  obtain 

(P  ~  1)P_1|CSC'|  44(144039.8) 

U  =  - - -  =  - : -  =  .480, 

[tr(CSC')]p_1  (93. 6)4 

u  =  6.170. 


For  degrees  of  freedom,  we  now  have  4  (4)  (5)  —1=9,  and  the  critical  value  is 
Xq5  9  =  16.92.  Hence,  we  do  not  reject  Hq:  CSC'  =  ct2I,  and  a  univariate  /-’-test 
of  Ho :  /x  i  =  fi2  =  ■  ■  ■  =  may  be  justified.  □ 

7.2.3  Testing  H0 :  £  =  <r2[(l  -  p)I  +  pj] 

In  Section  6.9.1,  it  was  noted  that  univariate  ANOVA  remains  valid  if 


S  =  er2 


M  p  p  ...  p  > 

p  1  p  ...  p 

V  p  p  p  ■■■  1  / 


=  a2[(l  -p)l+pj], 


(7.10) 


(7.11) 


where  J  is  a  square  matrix  of  1  ’s,  as  defined  in  (2.12),  and  p  is  the  population  correla¬ 
tion  between  any  two  variables.  This  pattern  of  equal  variances  and  equal  covariances 
in  S  is  variously  referred  to  as  uniformity,  compound  symmetry,  or  the  intraclass  cor¬ 
relation  model. 

We  now  consider  the  hypothesis  that  (7.10)  holds: 


(  -2 


H0:X  = 


2 

aAp 


2 

op 


o2p  \ 
a2  p 


\  a2 p  a2p 


From  a  sample,  we  obtain  the  sample  covariance  matrix  S.  Estimates  of  a2  and  a2 p 
under  Hq  are  given  by 


I  p 

=  -Ts,, 


and  szr  = 


- — E' 

pip-vjz 


' jk . 


(7.12) 


respectively,  where  sjj  and  s/k  are  from  S.  Thus  s2  is  an  average  of  the  variances 
on  the  diagonal  of  S,  and  s2r  is  an  average  of  the  off-diagonal  covariances  in  S.  An 
estimate  of  p  can  be  obtained  as  r  =  s2r/s2.  Using  s2  and  s2r  in  (7.12),  the  estimate 
of  2  under  Hq  is  then 
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/  i2  s2r  ■ ■ ■  s2r  \ 


So  = 


2  2 
s  r  s* 


2 

s  r 


=  ^[(l-r)l  +  rj]. 


(7.13) 


V  s2r  s2r  ■■  ■  s2  ) 

To  compare  S  and  So,  we  use  the  following  function  of  the  likelihood  ratio: 


=  JS|_ 
"  |S0| ’ 

which  can  be  expressed  in  the  alternative  form 

ISI 


(s2y’(l-r)/’-I[l  +  (p-l)r]' 
By  analogy  with  (7.9),  the  test  statistic  is  given  by 

pip  +  D2i2p  —  3) 


u  —  — 


6  (p  -  Dip2  +  p~  4) 


In  it , 


(7.14) 


(7.15) 


(7.16) 


where  v  is  the  degrees  of  freedom  of  S  (see  comments  at  the  beginning  of  Section 
7.2).  The  statistic  u'  is  approximately  x2[\pip  +  1)  —  2],  and  we  reject  Hq  if  u'  > 
X2[a,  \pip  +  1)  —  2].  Note  that  2  degrees  of  freedom  are  lost  due  to  estimation  of 
a2  and  p. 

An  alternative  approximate  test  that  is  more  precise  when  p  is  large  and  v  is 
relatively  small  is  given  by 


F  = 


-iy 2  -  Y2C\  -  ypv_ 

Ym 


In  u , 


where 


pip  +  l)2(2p  —  3) 

pip2-  Dip +  2) 

1  6v(p  -  Dip2  +  P  ~  4)  ’ 

C2  6v2(p2  +  p  —  4) 

1 

Yi  +  2 

Yl  =  -z PiP+  1)  -  2, 

Y2=  2- 

2 

C  2  ~  C] 

We  reject  H0:  X  —  a2[(l  -  p) I  +  pj]  if  F  >  FatYuri. 

Example  7.2.3.  To  illustrate  this  test,  we  use  the  cork  data  of  Table  6.21.  In  Prob¬ 
lem  6.34,  a  comparison  is  made  of  average  thickness,  and  hence  weight,  in  the  four 
directions.  A  standard  ANOVA  approach  to  this  repeated  measures  design  would  be 
valid  if  (7.10)  holds.  To  check  this  assumption,  we  test  Hq  :  X  —  a2[(l  —  p)I  +  p J] . 
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The  sample  covariance  matrix  is  given  by 


S  = 


from  which  we  obtain 

|S|  =  25,617,563.28, 


290.41 

223.75 

288.44 

226.27 

223.75 

219.93 

229.06 

171.37 

288.44 

229.06 

350.00 

259.54 

226.27 

171.37 

259.54 

226.00 

,2  = 


s2r 


=  ,  1  =  233.072, 

P(P  -  1  )fgJ 


=  271-586, 
233.072 


7 


From  (7.15)  and  (7.16),  we  now  have 

25,617,563.28 


(271.586)4(1  -  .858)3[1  +  (3)(.858)] 
p(p  +  l)2(2p  -  3) 


271.586 


=  .461, 


=  .858. 


27- 


6 {p  -  l)(p2  +  p  -  4) 
(4) (25) (5) 


in  u 


(6) (3) (16) 


.774=  19.511. 


Since  19.51 1  >  x  05  x  =  15.5,  we  reject  //o  and  conclude  that  2  does  not  have  the 
pattern  in  (7.10).  □ 


7.3  TESTS  COMPARING  COVARIANCE  MATRICES 

An  assumption  for  T2  or  MANOVA  tests  comparing  two  or  more  mean  vectors  is 
that  the  corresponding  population  covariance  matrices  are  equal:  2i  =  X2  =  •  •  •  = 
2a-.  Under  this  assumption,  the  sample  covariance  matrices  Si,  S2,  ...  ,  S k  reflect  a 
common  population  2  and  are  therefore  pooled  to  obtain  an  estimate  of  2.  If  2 1  = 
22  =  -  -  -  =  2*  is  not  true,  large  differences  in  Sj,  S2,  ...  ,  S*-  may  possibly  lead 
to  rejection  of  Hq\  p\  =  pti  =  •  •  •  =  Pk-  However,  the  T2  and  MANOVA  tests 
are  fairly  robust  to  heterogeneity  of  covariance  matrices  as  long  as  the  sample  sizes 
are  large  and  equal.  For  other  cases  it  is  useful  to  have  available  a  test  of  equality  of 
covariance  matrices.  We  begin  with  a  review  of  the  univariate  case. 


7.3.1  Univariate  Tests  of  Equality  of  Variances 

The  two-sample  univariate  hypothesis  Hq  :  oq2  =  a2  vs.  Hi :  af  ^  is  tested  with 


(7.17) 
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where  s2  and  s2  are  the  variances  of  the  two  samples.  If  Hq  is  true  (and  assuming 
normality),  /  is  distributed  as  FV{V 2,  where  v\  and  i<2  are  the  degrees  of  freedom  of 
s2  and  s2  (typically,  n  1  —  1  and  rn  —  1).  Note  that  ,v p  and  s2  must  be  independent, 
which  will  hold  if  the  two  samples  are  independent. 

For  the  several-sample  case,  various  procedures  have  been  proposed.  We  present 
Bartlett’s  (1937)  test  of  homogeneity  of  variances  because  it  has  been  extended  to 
the  multivariate  case.  To  test 


IT  .  „2  _  2  _  _  ^2 
//o .  CTj  —  a2  —  ■  ■  ■  —  ak. 


we  calculate 


c  —  1  + 


1 


3(*-l) 


'  k  I 

E1 

t-*  v. 


i= 1 


Ei=l  vi 


s2  = 


EU  y_i± 

Eli  V,- 


k 

1 

U'=l 


-  ( E  ) lns'2  ~  E u' ln  v'2, 


k 

i=l 


where  s2,  s2, . . .  ,  s£  are  independent  sample  variances  with  v»i,  V2,  ■  ■  •  ,  v*  degrees 
of  freedom,  respectively.  Then 

m  2 

—  is  approximately  Xk-i- 


We  reject  H0  if  m/c  >  x„,k- 1  • 

For  an  F-approximation,  we  use  c  and  m  and  calculate  in  addition 


a\  —  k  —  1,  ai  = 


k+  1 

(c  -  l)2  ’ 


b  = 


02 


2  -  c  +  2/a2 


Then 


F= — — - -  is  approximately  Fa  a 

ci  [  (h  —  m) 

We  reject  //o  if  F  >  Fu. 

Note  that  an  assumption  for  either  form  of  the  preceding  test  is  independence 
of  s2,  s2. . . .  ,  sk,  which  will  hold  for  random  samples  from  k  distinct  populations. 
This  test  would  therefore  be  inappropriate  for  comparing  in,  S22,  •  ■  ■  ,  spp  from  the 
diagonal  of  S,  since  the  Sjj’s  are  correlated. 

7.3.2  Multivariate  Tests  of  Equality  of  Covariance  Matrices 

For  k  multivariate  populations,  the  hypothesis  of  equality  of  covariance  matrices  is 


Hq:  Si  =  %2  —  ■  ■  ■  =  Xjt- 


(7.18) 
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The  test  of  Hq  :  Si  —X2  for  two  groups  is  treated  as  a  special  case  by  setting  k  —  2. 
There  is  no  exact  test  of  Hq  :  Si  =  S2  as  there  is  in  the  analogous  univariate  case  [see 
(7.17)].  We  assume  independent  samples  of  size  n\,  n 2,  ...  ,iik  from  multivariate 
normal  distributions,  as  in  an  unbalanced  one-way  MANOVA,  for  example.  To  make 
the  test,  we  calculate 


isiri/2is2r2/2...|sdw2 

M  —  - — — - '  2'  - ,  (7.19) 

|Spl|^A./2 

in  which  u,  =  n,  —  1,  S,  is  the  covariance  matrix  of  the  ;  th  sample,  and  Spi  is  the 
pooled  sample  covariance  matrix 


Snl  = 


EL  v/S ,■ 
E?=!  Vi 


E 

VE  ’ 


(7.20) 


where  E  is  given  by  (6.33)  and  \>e  =  E?=i  vi  —  Ei  n‘  —  ^  is  clear  that  we 

must  have  every  v,-  >  p\  otherwise  |S,j  =  0  for  some  i,  and  M  would  be  zero. 
Exact  upper  percentage  points  of  —  2  In  M  =  v(k  In  |Spi|  —  JT  In  |S,|)  for  the  special 
case  of  vi  =  V2  =  ■  ■  ■  =  =  v  are  given  in  Table  A.  14  for  p  =  2,  3,  4,  5  and 

k  —  2,  3,  . . .  ,10  (Lee,  Chiang,  and  Krishnaiah  1977).  We  can  easily  modify  (7.19) 
and  (7.20)  to  compare  covariance  matrices  for  the  cells  of  a  two-way  model  by  using 
Vij  ~  nU  ~  1- 

The  statistic  M  is  a  modification  of  the  likelihood  ratio  and  varies  between  0  and 
1,  with  values  near  1  favoring  Hq  in  (7.18)  and  values  near  0  leading  to  rejection  of 
Hq.  It  is  not  immediately  obvious  that  M  in  (7.19)  behaves  in  this  way,  and  we  offer 
the  following  heuristic  argument.  First  we  note  that  (7.19)  can  be  expressed  as 


M  = 


iSiiyi/2/is2iy2/2 
ISpilj  \  I  Spi  |  / 


(7.21) 


If  Si  =  S2  =  •  •  •  =  Sfc  =  Spi,  then  M  =  1.  As  the  disparity  among  Si,  S2, . . .  ,  Si- 
increases,  M  approaches  zero.  To  see  this,  note  that  the  determinant  of  the  pooled 
covariance  matrix,  |Spi|,  lies  somewhere  near  the“middle”  of  the  |S,j’s  and  that  as  a 
set  of  variables  z  1,  Z2,  ■  ■  ■  ,  Zn  increases  in  spread,  Z(i)/z  reduces  the  product  more 
than  Z(n)/z  increases  it,  where  Z(i)  and  Z(n)  are  the  minimum  and  maximum  values, 
respectively.  We  illustrate  this  with  the  two  sets  of  numbers,  4,  5,  6  and  1,5,9,  which 
have  the  same  mean  but  different  spread.  If  we  assume  U]  =  V2  =  V3  =  v,  then  for 
the  first  set. 


Mi 


[(.8)(l)(1.2)]v/2  =  (,96)v/2 


and  for  the  second  set. 
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M2 


[(.2)(l)(1.8)r/2  =  (.36)v/2. 


In  M2,  the  smallest  value,  .2,  reduces  the  product  proportionally  more  than  the  largest 
value,  1.8,  increases  it.  Another  illustration  is  found  in  Problem  7.9. 

Box  (1949,  1950)  gave  /2-  and  /’-approximations  for  the  distribution  of  M. 
Either  of  these  approximate  tests  is  referred  to  as  Box’s  M-test.  For  the  /2- 
approximation,  calculate 


bri 

>  1 

"  2p2  +  3p-  1  " 

h " 

- 1 

w 

1 

6  (P+m-i)_ 

(7.22) 


Then 


u  —  —2(1  —  ci)  InM  is  approximately  x 2  j(k  —  1) pip  +  1) 


(7.23) 


where  M  is  defined  in  (7.19),  and 


1  k  i  /  *  \ 

In M  =  2  E  Vi  ln  Is''  I  -  T  (  E  ViJ  ln  ISpiI-  (7-24) 


We  reject  Hq  if  u  >  /2.  \fv\  =  v2  =  ■  ■  ■  =  Vk  =  v,  then  ci  simplifies  to 


(k+  l)(2p2  +  3p-  1) 
6kv(p  +  1) 


(7.25) 


To  justify  the  degrees  of  freedom  of  the  / 2 -approximation,  note  that  the  total 
number  of  parameters  estimated  under  Hi  is  k[\p(p  +  1)],  whereas  under  Hq  we 
estimate  only  2,  which  has  p  +  (^)  =  \ p( P  +  1)  parameters.  The  difference  is 
(k  —  1  )[jp(p  +  1)]-  The  quantity  k[^p(p  +  1)]  arises  from  the  assumption  that  all 
Xi,i  —  1,2,...  ,  k,  are  different.  Technically,  H i  can  be  stated  as  2/  ^  2/  for  some 
i  ^  j.  However,  the  most  general  case  is  all  2,  different,  and  the  distribution  of  M 
is  derived  accordingly. 

For  the  /-’-approximation,  we  use  c \  from  (7.22)  and  calculate,  additionally. 


C2 


(P  -  1  HP  +  2) 
6 (k  -  1) 


E4- 


fli  =  ~(k~  1  )p(p+  1), 


02  = 


(£?-«)  J 

o\  +  2 


\C2  ~  C2| 


b\  = 


1  —  ci  —  ai/ai 


a  i 


b2  = 


1  -  ci  +  2/a2 


(7.26) 


02 
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If  CT  >  Cj, 


If  C2  <  Cj, 


F  —  —2b\h\M  is  approximately  Fa]  aj 


(7.27) 


2tnbi  In  M 

F  = - ,  ,  o,  i  777  is  approximately  Fa, >fl2 

fli  (1  +  27-9  In  M 


(7.28) 


In  either  case,  we  reject  II q  if  F  >  Fa.  If  vi  =  V2  =  •  •  •  =  Vk  =  v,  then  ci  simplifies 
as  in  (7.25)  and  C2  simplifies  to 


(P-  l)(p  +  2)(/c2  +  k+l) 
6  k2v2 


(7.29) 


Box’s  M-test  is  calculated  routinely  in  many  computer  programs  for  MANOVA. 
However,  Olson  (1974)  showed  that  the  M-test  with  equal  v,  may  detect  some  forms 
of  heterogeneity  that  have  only  minor  effects  on  the  MANOVA  tests.  The  test  is  also 
sensitive  to  some  forms  of  nonnormality.  For  example,  it  is  sensitive  to  kurtosis  for 
which  the  MANOVA  tests  are  rather  robust.  Thus  the  M-test  may  signal  covariance 
heterogeneity  in  some  cases  where  it  is  not  damaging  to  the  MANOVA  tests.  Hence 
we  may  not  wish  to  automatically  rule  out  standard  MANOVA  tests  if  the  M-test 
leads  to  rejection  of  Ho .  Olson  showed  that  the  skewness  and  kurtosis  statistics  b\p 
and  hi.p  (see  Section  4.4.2)  have  similar  shortcomings. 


Example  7.3.2.  We  test  the  hypothesis  Hq\  X\  —  Xi  for  the  psychological  data 
of  Table  5.1.  The  covariance  matrices  Si,  S2,  and  Spi  were  given  in  Example  5.4.2. 
Using  these,  we  obtain,  by  (7.24), 

In  M  =  j[vi  In  |Si|  +  v2  In  |S2|]  -  |(vt  +  v2)ln  |Spi| 

=  i[(31)ln(7917.7)  +  (31)  ln(58958.1)] 

-±(31  +  31)  ln(27325.2)  =  -7.2803. 

For  an  exact  test,  we  compare 


-2  In  M  =  14.561 


with  19.74,  its  critical  value  from  Table  A.  14. 

For  the  ^-approximation,  we  compute,  by  (7.25)  and  (7.23), 


ci 


(2  +  1)[2(42)  +  3(4)  —  1] 
6(2)  (31)  (4  +  1) 


.06935, 


u  —  —2(1  —  ci)  InM  =  13.551  <  x  05  10  =  18-307. 
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For  an  approximate  /-’-test,  we  first  calculate  the  following: 


ci 


(4-l)(4  +  2) 
6(2-1) 


11  1 

312  +  3I2  _  (31  +  31)2 


=  .005463, 


ai  =  -(2  —  1)(4)(4  +  1)  =  10, 


ai 

b\ 

bi 


10+2 


1.005463  -  .069352 1 
1  -  .06935  -  10/18377.7 


10 


=  18377.7, 

=  .0930, 


1  -  .06935  +  2/18377.7  _  < 

'  —  =  5.0646  x  10“5. 


18377.7 

Since  ci  =  .005463  >  c2  =  .00481,  we  use  (7.27)  to  obtain 

F  =  —2b\  In  M  =  1.354  <  +.05,10,00  =  1.83. 
Thus  all  three  tests  accept  Hq. 


□ 


7.4  TESTS  OF  INDEPENDENCE 
7.4.1  Independence  of  Two  Subvectors 

Suppose  the  observation  vector  is  partitioned  into  two  subvectors  of  interest,  which 
we  label  y  and  x,  as  in  Section  3.8.1,  where  y  is  p  x  1  and  x  is  q  x  1.  By  (3.46),  the 
corresponding  partitioning  of  the  population  covariance  matrix  is 

V  _  (  2 yy  'Zyx 

l  Xxy  Xxx 

with  analogous  partitioning  of  S  and  R  as  in  (3.42): 

Syy  Syx  \  R  _  (  Ryy  R.v* 

S.Cy  Sxx  )  ’  V  R.vv  Rvr 

The  hypothesis  of  independence  of  y  and  x  can  be  expressed  as 

Ho  '-  X  =  (  j  or  Ho :  Xyx  =  O. 

Thus  independence  of  y  and  x  means  that  every  variable  in  y  is  independent  of  every 
variable  in  x.  Note  that  there  is  no  restriction  on  Xyy  or  X**. 
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The  likelihood  ratio  test  statistic  for  Hq  :  Xyx  —  O  is  given  by 


|Syy||S„|  |R,vl|R«l’ 

which  is  distributed  as  Rpqn-\-q.  We  reject  Hq  if  A  <  A„.  We  thus  have  an  exact 
test  for  Hq  :  Xyx  =  O.  Critical  values  for  Wilks’  A  are  given  in  Table  A. 9  using 
vh  =  q  and  ve  =  n  —  1  —  q.  The  test  statistic  in  (7.30)  is  equivalent  (when  Hq 
is  true)  to  the  A-statistic  (10.55)  in  Section  10.5.1  for  testing  the  significance  of  the 
regression  of  y  on  x. 

By  the  symmetry  of 


IS]  _  |S| 

|Sy  V||S.„|  I  S  O'  |  |  Syy  |  ' 


A  in  (7.30)  is  also  distributed  as  A qtPJl-\-p.  This  is  equivalent  to  property  3  in 
Section  6.1.3. 

Note  that  |  Syy  1 1 S**  |  in  (7.30)  is  an  estimate  of  |XVy||Xjrj;|,  which  by  (2.92)  is  the 
determinant  of  X  when  Xyx  =  O.  Thus  Wilks’  A  compares  an  estimate  of  X  without 
restrictions  to  an  estimate  of  X  under  Hq  :  X xx  =  O.  We  can  see  intuitively  that 
|S|  <  | Syy j | Syy  |  by  noting  from  (2.94)  that  |S|  =  |S^||Syy  -  S^S^S^I,  and  since 
Sy^SjT^S.yy  is  positive  definite,  |Syy  —  SyySjr]  SA-y|  <  |Syy|.  This  can  be  illustrated 
for  the  case  p  =  q  —  1 : 


ISI  = 


*yx 


* yx 


_  2  2 
—  sysx  ' 


(SyX  )  <  ^y^X  ■ 


As  increases,  |S|  decreases. 

Wilks’  A  in  (7.30)  can  be  expressed  in  terms  of  eigenvalues: 


A=na-r),  (7.3d 

7=1 


where  ,v  =  min( p.  q)  and  the  rf’ s  are  the  nonzero  eigenvalues  of  S^1  SyyS^,1  Svv.  We 
could  also  use  Sj]1  Sy^SjrJ  Syy,  since  the  (nonzero)  eigenvalues  of  S^1  S  wS^1  S  vv 
are  the  same  as  those  of  Sjl]  SAVS]]  Svx  (these  two  matrices  are  of  the  form  AB 
and  BA;  see  Section  2.11.5).  The  number  of  nonzero  eigenvalues  is  s  —  min (p,  q), 
since  s  is  the  rank  of  both  Sj]1  SyySjr]  Syy  and  .Sj),1  SyVS]]  Svx.  The  eigenvalues  are 
designated  r?  because  they  are  the  squared  canonical  correlations  between  y  and  x 
(see  Chapter  11).  In  the  special  case  p  —  1,  (7.31)  becomes 


A  =  1  —  r\  =  1  -  R2, 

where  R 2  is  the  square  of  the  multiple  correlation  between  y  and  (xi,X2,  ■  ■  ■  ,xq). 
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The  other  test  statistics,  U^’,  Vis> ,  and  Roy’s  6,  can  also  be  defined  in  terms  of 
the  r?’s  (see  Section  11.4.1). 


Example  7.4.1.  Consider  the  diabetes  data  in  Table  3.4.  There  is  a  natural  partition¬ 
ing  in  the  variables,  with  y i  and  y2  of  minor  interest  and  x\ ,  X2,  and  x-}  of  major 
interest.  We  test  independence  of  the  y’s  and  the  x’s,  that  is,  Hq  :  Xyx  =  O.  From 
Example  3.8.1,  the  partitioned  covariance  matrix  is 


( 

.0162 

.2160 

.7872 

-.2138 

2.189  \ 

/  c 

S  \ 

.2160 

70.56 

26.23 

-23.96 

-20.84 

s  =  (  lyy 

\ 

•’v-v  \  _ 

s„  J 

.7872 

-.2138 

26.23 

-23.96 

1106 

396.7 

396.7 

2382 

108.4 

1143 

\ 

2.189 

-20.84 

108.4 

1143 

2136  / 

To  make  the  test,  we  compute 


A  = 


3.108  x  109 
(1.095)(3.920  x  109) 


=  .724  <  A  05,2,3,40 


.730. 


Thus  we  reject  the  hypothesis  of  independence.  Note  the  use  of  40  in  A .05,2,3.40  in 
place  of  n  —  1—  q  =  46—  1  —  3  =  42.  This  is  a  conservative  approach  that  allows 
the  use  of  a  table  value  without  interpolation.  □ 


7.4.2  Independence  of  Several  Subvectors 

Let  there  be  k  sets  of  variates  so  that  y  and  X  are  partitioned  as 


'  yi  ' 

(  Xu 

X12  • 

••  Xu  \ 

y2 

X21 

X22  • 

■  •  X2  k 

and  X  = 

V 

V 

V 

\  y k  )  \  x>tl  Xkl  ■  ■  ■  Xkk  / 


with  pi  variables  in  y,-,  where  p\  +  p2  +  •  — \-  Pk  —  P-  Note  that  yi,  y2,  ■ . .  ,  y k  rep¬ 
resents  a  partitioning  of  y,  not  a  random  sample  of  independent  vectors.  The  hypoth¬ 
esis  that  the  subvectors  yi,  y2,  •  ■  •  ,  y k  are  mutually  independent  can  be  expressed  as 
Ho :  Xij  =  O  for  all  i  ^  j,  or 


(  Xn  O  O 

O  S22  •  ■  ■  o 

V  O  o  •••  xkk 


\ 

/ 


H0:X  = 


(7.32) 
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The  likelihood  ratio  statistic  is 


|S| 

(7.33) 

IS11IIS22I  •  • 

•  |S**| 

|R| 

(7.34) 

IR.nMR.22l  • 

••|R kk\’ 

where  S  and  R  are  obtained  from  a  random  sample  of  n  observations  and  are  par¬ 
titioned  as  X  above,  conforming  to  v i ,  V'2,  ■  ■  •  .  }'k-  Note  that  the  denominator  of 
(7.33)  is  the  determinant  of  S  restricted  by  Ho,  that  is,  with  S/j  —  O  for  all  i  ^  j 
[see  (2.92)].  The  statistic  u  does  not  have  Wilks’  A -distribution  as  it  does  in  (7.30) 
when  k  —  2,  but  a  good  /  ^approximation  to  its  distribution  is  given  by 


u  —  —  vclnt/, 


(7.35) 


where 


c  —  1  -  — l—  (2a3  +  3a2),  (7.36) 

12/ v 

i  k  k 

/  =  ~ai,  a2  =  p~  -  2_  Pi>  «3  =  P  ~  Pi  ’ 

Z  !  =  1  1  =  1 

and  v  is  the  degrees  of  freedom  of  S  or  R  (see  comments  at  the  beginning  of  Sec¬ 
tion  7.2).  We  reject  the  independence  hypothesis  if  u'  >  Xa  f- 

The  degrees  of  freedom,  /  =  \a2,  arises  from  thefollowing  consideration.  The 
number  of  parameters  in  X  unrestricted  by  the  hypothesis  is  jp(p  +  1).  Under  the 
hypothesis  (7.32),  the  number  of  parameters  in  each  X//  is  \  pt  ( p,  +  1),  for  a  total  of 
j  X!/=i  Pi  ( Pi  +  !)•  The  difference  is 

/  =  l;P(P  +  !)  -  ^  J2  PiiPi  +  1)  =  \  (p2  +  P  -  Pi 

i= 1  \  i 

=  =  f- 

Example  7.4.2.  For  30  brands  of  Japanese  Seishu  wine,  Siotani  et  al.  (1963)  studied 
the  relationship  between 


y\  =  taste, 
V2  =  odor, 
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Table  7.1.  Seishu  Measurements 


yi 

?2 

*1 

*2 

*3 

X4 

*5 

*6 

A  7 

*8 

1.0 

.8 

4.05 

1.68 

.85 

3.0 

3.97 

5.00 

16.90 

122.0 

.1 

.2 

3.81 

1.39 

.30 

.6 

3.62 

4.52 

15.80 

62.0 

.5 

.0 

4.20 

1.63 

.92 

-2.3 

3.48 

4.46 

15.80 

139.0 

.7 

.7 

4.35 

1.43 

.97 

-1.6 

3.45 

3.98 

15.40 

150.0 

-.1 

-1.1 

4.35 

1.53 

.87 

-2.0 

3.67 

4.22 

15.40 

138.0 

.4 

.5 

4.05 

1.84 

.95 

-2.5 

3.61 

5.00 

16.78 

123.0 

.2 

-.3 

4.20 

1.61 

1.09 

-1.7 

3.25 

4.15 

15.81 

172.0 

.3 

-.1 

4.32 

1.43 

.93 

-5.0 

4.16 

5.45 

16.78 

144.0 

.7 

.4 

4.21 

1.74 

.95 

-1.5 

3.40 

4.25 

16.62 

153.0 

.5 

-.1 

4.17 

1.72 

.92 

-1.2 

3.62 

4.31 

16.70 

121.0 

-.1 

.1 

4.45 

1.78 

1.19 

-2.0 

3.09 

3.92 

16.50 

176.0 

.5 

-.5 

4.45 

1.48 

.86 

-2.0 

3.32 

4.09 

15.40 

128.0 

.5 

.8 

4.25 

1.53 

.83 

-3.0 

3.48 

4.54 

15.55 

126.0 

.6 

.2 

4.25 

1.49 

.86 

2.0 

3.13 

3.45 

15.60 

128.0 

.0 

-.5 

4.05 

1.48 

.30 

.0 

3.67 

4.52 

15.38 

99.0 

-.2 

-.2 

4.22 

1.64 

.90 

-2.2 

3.59 

4.49 

16.37 

122.8 

.0 

-.2 

4.10 

1.55 

.85 

1.8 

3.02 

3.62 

15.31 

114.0 

.2 

.2 

4.28 

1.52 

.75 

-4.8 

3.64 

4.93 

15.77 

125.0 

-.1 

-.2 

4.32 

1.54 

.83 

-2.0 

3.17 

4.62 

16.60 

119.0 

.6 

.1 

4.12 

1.68 

.84 

-2.1 

3.72 

4.83 

16.93 

111.0 

.8 

.5 

4.30 

1.50 

.92 

-1.5 

2.98 

3.92 

15.10 

68.0 

.5 

.2 

4.55 

1.50 

1.14 

.9 

2.60 

3.45 

15.70 

197.0 

.4 

.7 

4.15 

1.62 

.78 

-7.0 

4.11 

5.55 

15.50 

106.0 

.6 

-.3 

4.15 

1.32 

.31 

.8 

3.56 

4.42 

15.40 

49.5 

-.7 

-.3 

4.25 

1.77 

1.12 

.5 

2.84 

4.15 

16.65 

164.0 

-.2 

.0 

3.95 

1.36 

.25 

1.0 

3.67 

4.52 

15.98 

29.5 

.3 

-.1 

4.35 

1.42 

.96 

-2.5 

3.40 

4.12 

15.30 

131.0 

.1 

.4 

4.15 

1.17 

1.06 

-4.5 

3.89 

5.00 

16.79 

168.2 

.4 

.5 

4.16 

1.61 

.91 

-2.1 

3.93 

4.35 

15.70 

118.0 

-.6 

-.3 

3.85 

1.32 

.30 

.7 

3.61 

4.29 

15.71 

48.0 

and 

xi  —  PH.  x 5  =  direct  reducing  sugar, 

a 2  =  acidity  1 ,  X(t  =  total  sugar, 

A3  =  acidity  2,  A7  =  alcohol, 

A4  =  sake  meter,  Ag  =  formyl-nitrogen. 

The  data  are  in  Table  7.1. 

We  test  independence  of  the  following  four  subsets  of  variables: 


Ol,  yi),  (Xi,X2,  X3),  (A4,  X5,X6),  (xj ,  Ag). 
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The  sample  covariance  matrix  is 


Sll  S12  S13  Si4  ^ 

S21  S22  S23  S24 

S31  S32  S33  S34 

S41  S42  S43  S44  ) 

.16  .10  .01  .006  .02 

.10  .19  -.01  .009  .02 


.01  -.01 

9  .( 

2 


6  -.11  -. 

3  -.03  - 

.01  .05  -.03  .( 


-.02  .04  -.01  .038  .05 

1.44  1.03  4.45  2.23  9.03 


-.04  .02  .01  -.02 

-.16  .03  .05  .04 


1 

2  -.( 

8  -.03 


5.02  -.35  -.67 

-.35  .13  .15 

-.67  .15  .26 


-.12  .05  .13  .35 

■23.11  -4.26  -3.47  6.73 


)3 

-.01 

4.45 

)4 

.038 

2.23 

)3 

.05 

9.03 

where  Si  1  is  2  x  2,  S22  is  3  x  3,  S33  is  3  x  3,  and  S44  is  2  x  2.  We  first  obtain 


|  S 1 1 1 1 S22 1 1 S33 1 1 S44 1 

2.925  x  10“7 


=  .01627. 


(.02 10)  (.0000158)  (.0361)  (496. 04) 

For  the  x  ^approximation,  we  calculate 

a2  =  P2-J2  Pi  =  l°2  -  (22  +  32  +  32  +  22)  =  74, 

i  —  I 

4  j 

«3  =  P3  Pi  =  930’  /  =  2  =  37, 


1  2(930)  +  3(74) 

c  =  1 - (2a*  +  3  an)  =  1  -  — - - - —  =  .838. 

12/v  12(37)(29) 


u'  =  —vc  Inn  =  — (29)(.838)  ln(. 01627)  =  100.122, 


which  exceeds  x  qoi  37  —  69.35,  and  we  reject  the  hypothesis  of  independence  of  the 
four  subsets.  □ 
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7.4.3  Test  for  Independence  of  All  Variables 

If  all  pi  =  1  in  the  hypothesis  (7.32)  in  Section  7.4.2,  we  have  the  special  case  in 
which  all  the  variables  are  mutually  independent,  Hq  :  ajk  =  0  for  j  ^  k,  or 


H0:X  = 


(  o-ll 

0 


0 

<722 


0  ^ 

0 


V  o  0  •••  aPP  / 


There  is  no  restriction  on  the  <t/;  ’s.  With  a  -p  =  0  for  all  j  ^  k,  the  corresponding 
Pjk’s  are  also  0,  and  an  equivalent  form  of  the  hypothesis  is  Ho :  Pp  =  I,  where  Pp 
is  thepopulation  correlation  matrix  defined  in  (3.37). 

Since  all  p\  =  1.  the  statistics  (7.33)  and  (7.34)  reduce  to 

ISI 

u  = - — - =  |R|,  (7.37) 

511^22  '  '  '  Spp 


and  (7.35)  becomes 


u'  =  —[v  —  \{2p  +  5)]  Inn,  (7.38) 

which  has  an  approximate  /y -distribution,  where  v  is  the  degrees  of  freedom  of  S 
or  R  (see  a  comment  at  the  beginning  of  Section  7.2)  and  /  =  \  p(  p  —  1)  is  the 
degrees  of  freedom  of  /2.  We  reject  Hq  if  it  >  f-  Exact  percentage  points  of  it 
for  selected  values  of  n  and  p  are  given  in  Table  A.  15  (Mathai  and  Katiyar  1979). 
Percentage  points  for  the  limiting  x  2 -distribution  are  also  given  for  comparison. 

Note  that  |R|  in  (7.37)  varies  between  0  and  1.  If  the  variables  were  uncorrelated 
(in  the  sample),  we  would  have  R  =  I  and  |  R|  =  1 .  On  the  other  hand,  if  two  or  more 
variables  were  linearly  related,  R  would  not  be  full  rank  and  we  would  have  |R|  =  0. 
If  the  variables  were  highly  correlated,  |R|  would  be  close  to  0;  if  the  correlations 
were  all  small,  |R|  would  be  close  to  1.  This  can  be  illustrated  for  p  —  2: 


|R| 


1  r 
r  1 


1  -r- 


Example  7.4.3.  To  test  the  hypothesis 
from  Table  3.5,  we  calculate 


1.000 

.614 

.614 

1.000 

.757 

.547 

.575 

.750 

.413 

.548 

Ho  -  Ojk 

II 

O 

k,  for  the  probe  word  data 

.757 

.575 

.413  \ 

.547 

.750 

.548 

1.000 

.605 

.692 

.605 

1.000 

.524 

.692 

.524 

1.000  j 

R  = 
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Then  by  (7.37)  and  (7.38), 


u  =  |R|  =  .0409, 


u 


t 


[n  -\-{{2p  +  5) 


In  u  =  23.97. 


The  exact  .01  critical  value  for  u'  from  Table  A.  15  is  23.75,  and  we  therefore  reject 
//().  The  approximate  x2  critical  value  for  u'  is  Xq\  io  =  23.21,  with  which  we  also 
reject  Hq.  □ 


PROBLEMS 


7.1  Show  that  if  S  =  So  in  (7.1),  then  u  —  0. 

7.2  Verify  (7.3);  that  is,  show  that  In  |Xol  —  In  |S|  =  —  In  |SXq  ^ 

7.3  Verify  (7.4);  that  is,  show  that  —  ln(f|?=1  A;)  +  X^f=i  Ai  —  5Zf=i  (Ai  ~  In  A.,). 

7.4  Show  that  the  likelihood  ratio  for  Hq:  X  =  a2 1  is  given  by  (7.5),  LR  = 
[|S|/(trS/p)p]"/2. 

7.5  Show  that  u  =  1  and  u'  =  0  if  all  the  A,  ’s  are  equal,  as  noted  in  Section  7.2.2, 
where  u  is  given  by  (7.8)  and  u'  by  (7.9). 

7.6  Show  that  the  covariance  matrix  in  (7.10)  can  be  written  in  the  form  ct2[(1  — 
p)I  +  pj],  as  given  in  (7.11). 

7.7  Obtain  (7.15)  from  (7.14)  as  follows: 

(a)  Show  that  the  p  x  p  matrix  J  has  a  single  nonzero  eigenvalue  equal  to  p 
and  corresponding  eigenvector  proportional  to  j. 

(b)  Show  that  So  =  a2 [( 1  —  r)I  +  /-J]  in  (7.13)  can  be  written  in  the  form 
S0  =  .2(l-r)(I+Tf7J). 

(c)  Use  Section  2.11.2  and  (2.108)  to  obtain  (7.15). 


7.8  Show  that  M  in  (7.19)  can  be  expressed  in  the  form  given  in  (7.21). 

7.9  (a)  Calculate  M  as  given  in  (7.21)  for 

s'=(;  !)•  *=0  «)• 

Assume  vi  —  V2  =  5. 

(b)  Calculate  M  for 


Si 


2 

1 


1 

4 


S2  = 


10  15  \ 
15  30  J  ' 


Assume  vi  —  vo  —  5. 

In  (b).  Si  and  S2  differ  more  than  in  (a)  and  M  is  accordingly  much 
smaller.  This  illustrates  the  comments  following  (7.21). 
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7.10 

7.11 

7.12 

7.13 

7.14 


7.15 

7.16 

7.17 

7.18 

7.19 

7.20 

7.21 

7.22 

7.23 

7.24 

7.25 

7.26 

7.27 


Obtain  (7.31),  A  =  nf=i (1  —  rf'>-  by  using  (2.94)  to  write  |S|  in  the  form 
|S|  =  IS,, 1 1 S v v  -  SyxS~%y\. 

Show  that  the  forms  of  u  in  (7.33)  and  (7.34)  reduce  to  (7.37)  when  all  p,-  =  1. 

Show  that  when  all  p,  —  1,  c  in  (7.36)  reduces  to  1  —  (2 p  +  5)/6v,  so  that 
(7.35)  becomes  (7.38). 

Give  a  justification  for  the  degrees  of  freedom  /  =  \p(p  —  1)  for  the  approx¬ 
imate  x2  test  statistic  u’  in  (7.38). 

In  Example  5.2.2,  we  assumed  that  for  the  height  and  weight  data  of  Table  3.1, 
the  population  covariance  matrix  is 

X  =  (  20 
^  100 

Test  this  as  a  hypothesis  using  (7.2). 

Test  H0:  X  =  a2\  and  H0:  CXC  = 

Table  6.12. 

Test  Hq:  X  —  <r2I  and  Ho :  CXC'  —  er2I  for  the  ramus  bone  data  of  Table  3.6. 

Test//o:  X  =  a  2 1  and  Ho :  CXC  =  er2I  for  the  cork  data  of  Table  6.21. 

Test  Ho :  X  =  o:2[(  I  —  p)I+  pj]  for  the  probe  word  data  in  Table  3.5.  Use  both 
X2-  and  /-'-approximations. 

Test  Hq:  X  =  er2[(l  —  p) I  +  pj]  for  the  calculator  speed  data  in  Table  6.12. 
Use  both  x2-  and  ^-approximations. 

Test  Ho'.  X  =  ct2[(  1  —  p)I  +  pj]  for  the  ramus  bone  data  in  Table  3.6.  Use 
both  x2-  and  f ’-approximations. 

Test  Hq  :  X  \  —X2  for  the  beetles  data  of  Table  5.5.  Use  an  exact  critical  value 
from  Table  A.  14  as  well  as  /2-  and  /-’-approximations. 

Test  Hq:  Si  =  X2  for  the  engineer  data  of  Table  5.6.  Use  an  exact  critical 
value  from  Table  A.  14  as  well  as  y2-  and  ^-approximations. 

Test  Hq:  X\  =  X2  for  the  dystrophy  data  of  Table  5.7.  Use  an  exact  critical 
value  from  Table  A.  14  as  well  as  y2-  and  ^-approximations. 

Test  Ho'-  X\  —  X2  for  the  cyclical  data  of  Table  5.8.  Use  an  exact  critical  value 
from  Table  A.  14  as  well  as  /2-  and  ^-approximations. 

Test  Hq:  Xi  —  X2  =  S3  for  the  fish  data  of  Table  6.17.  Use  an  exact  critical 
value  from  Table  A.  14  as  well  as  y2-  and  ^-approximations. 

Test  Hq:  Si  =  S2  =  •  •  •  =  Sf,  for  the  rootstock  data  in  Table  6.2.  Use  an 
exact  critical  value  from  Table  A. 14  as  well  as  y2-  and  /-’-approximations. 

Test  Hq:  Si  1  =  S12  =  •  •  •  =  S43  for  the  snap  bean  data  in  Table  6.18.  Use 
both  x2-  and  /-’-approximations. 


100  \ 

1000  )  ’ 

er2I  for  the  calculator  speed  data  of 
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7.28  Test  independence  of  ( yi ,  yy)  and  (at  ,  X2)  for  the  sons  data  in  Table  3.7. 

7.29  Test  independence  of  (}T ,  yi,  >’3)  and  (x  1 .  A'2 ,  A3 )  for  the  glucose  data  in 
Table  3.8. 

7.30  Test  independence  of  (yi,y2)  and  (at,  *2, ...  ,  xs)  f°r  the  Seishu  data  of 
Table  7.1. 

7.31  The  data  in  Table  7.2  relate  temperature,  humidity,  and  evaporation  (courtesy 
of  R.  J.  Freund).  The  variables  are 

y  1  =  maximum  daily  air  temperature, 
y2  =  minimum  daily  air  temperature, 

y3  =  integrated  area  under  daily  air  temperature  curve,  that  is, 
a  measure  of  average  air  temperature, 

y4  =  maximum  daily  soil  temperature, 

y$  =  minimum  daily  soil  temperature, 

T6  =  integrated  area  under  soil  temperature  curve, 

V7  =  maximum  daily  relative  humidity, 

y8  =  minimum  daily  relative  humidity, 

V9  =  integrated  area  under  daily  humidity  curve, 

yio  =  total  wind,  measured  in  miles  per  day, 

y  11  =  evaporation. 

Test  independence  of  the  following  five  groups  of  variables:  (>’i ,  yi,  >’3). 
(>’4,  ys,  ye),  ( yi ,  ys,  >’9),  yio,  and  yn. 

7.32  Test  the  independence  of  all  the  variables  for  the  calcium  data  of  Table  3.3. 

7.33  Test  the  independence  of  all  the  variables  for  the  calculator  speed  data  of 
Table  6.12. 

7.34  Test  the  independence  of  all  the  variables  for  the  ramus  bone  data  of  Table  3.6. 

7.35  Test  the  independence  of  all  the  variables  for  the  cork  data  of  Table  6.21 . 
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Table  7.2.  Temperature,  Humidity,  and  Evaporation 


yi 

yi 

ys 

,V4 

>’5 

ye 

y? 

ys 

ys 

yio 

yn 

84 

65 

147 

85 

59 

151 

95 

40 

398 

273 

30 

84 

65 

149 

86 

61 

159 

94 

28 

345 

140 

34 

79 

66 

142 

83 

64 

152 

94 

41 

368 

318 

33 

81 

67 

147 

83 

65 

158 

94 

50 

406 

282 

26 

84 

68 

167 

88 

69 

180 

93 

46 

379 

311 

41 

74 

66 

131 

77 

67 

147 

96 

73 

478 

446 

4 

73 

66 

131 

78 

69 

159 

96 

72 

462 

294 

5 

75 

67 

134 

84 

68 

159 

95 

70 

464 

313 

20 

84 

68 

161 

89 

71 

195 

95 

63 

430 

455 

31 

86 

72 

169 

91 

76 

206 

93 

56 

406 

604 

36 

88 

73 

176 

91 

76 

206 

94 

55 

393 

610 

43 

90 

74 

187 

94 

76 

211 

94 

51 

385 

520 

47 

88 

72 

171 

94 

75 

211 

96 

54 

405 

663 

45 

58 

72 

171 

92 

70 

201 

95 

51 

392 

467 

45 

81 

69 

154 

87 

68 

167 

95 

61 

448 

184 

11 

79 

68 

149 

83 

68 

162 

95 

59 

436 

177 

10 

84 

69 

160 

87 

66 

173 

95 

42 

392 

173 

30 

84 

70 

160 

87 

68 

177 

94 

44 

392 

76 

29 

84 

70 

168 

88 

70 

169 

95 

48 

396 

72 

23 

77 

67 

147 

83 

66 

170 

97 

60 

431 

183 

16 

87 

67 

166 

92 

67 

196 

96 

44 

379 

76 

37 

89 

69 

171 

92 

72 

199 

94 

48 

393 

230 

50 

89 

72 

180 

94 

72 

204 

95 

48 

394 

193 

36 

93 

72 

186 

92 

73 

201 

94 

47 

386 

400 

54 

93 

74 

188 

93 

72 

206 

95 

47 

389 

339 

44 

94 

75 

199 

94 

72 

208 

96 

45 

370 

172 

41 

93 

74 

193 

95 

73 

214 

95 

50 

396 

238 

45 

93 

74 

196 

95 

70 

210 

96 

45 

380 

118 

42 

96 

75 

198 

95 

71 

207 

93 

40 

365 

93 

50 

95 

76 

202 

95 

69 

202 

93 

39 

357 

269 

48 

84 

73 

173 

96 

69 

173 

94 

58 

418 

128 

17 

91 

71 

170 

91 

69 

168 

94 

44 

420 

423 

20 

88 

72 

179 

89 

70 

189 

93 

50 

399 

415 

15 

89 

72 

179 

95 

71 

210 

98 

46 

389 

300 

42 

91 

72 

182 

96 

73 

208 

95 

43 

384 

193 

44 

92 

74 

196 

97 

75 

215 

96 

46 

389 

195 

41 

94 

75 

192 

96 

69 

198 

95 

36 

380 

215 

49 

96 

75 

195 

95 

67 

196 

97 

24 

354 

185 

53 

93 

76 

198 

94 

75 

211 

93 

43 

364 

466 

53 

88 

74 

188 

92 

73 

198 

95 

52 

405 

399 

21 

88 

74 

178 

90 

74 

197 

95 

61 

447 

232 

1 

91 

72 

175 

94 

70 

205 

94 

42 

380 

275 

44 

92 

72 

190 

95 

71 

209 

96 

44 

379 

166 

44 

92 

73 

189 

96 

72 

208 

93 

42 

372 

189 

46 

94 

75 

194 

95 

71 

208 

93 

43 

373 

164 

47 

96 

76 

202 

96 

71 

208 

94 

40 

368 

139 

50 

CHAPTER  8 


Discriminant  Analysis:  Description 
of  Group  Separation 


8.1  INTRODUCTION 

We  use  the  term  group  to  represent  either  a  population  or  a  sample  from  the  popula¬ 
tion.  There  are  two  major  objectives  in  separation  of  groups: 

1.  Description  of  group  separation,  in  which  linear  functions  of  the  variables  (dis¬ 
criminant  functions)  are  used  to  describe  or  elucidate  the  differences  between 
two  or  more  groups.  The  goals  of  descriptive  discriminant  analysis  include 
identifying  the  relative  contribution  of  the  p  variables  to  separation  of  the 
groups  and  finding  the  optimal  plane  on  which  the  points  can  be  projected 
to  best  illustrate  the  configuration  of  the  groups. 

2.  Prediction  or  allocation  of  observations  to  groups,  in  which  linear  or  quadratic 
functions  of  the  variables  (classification  functions)  are  employed  to  assign  an 
individual  sampling  unit  to  one  of  the  groups.  The  measured  values  in  the 
observation  vector  for  an  individual  or  object  are  evaluated  by  the  classification 
functions  to  find  the  group  to  which  the  individual  most  likely  belongs. 

For  consistency  we  will  use  the  term  discriminant  analysis  only  in  connection 
with  objective  1.  We  will  refer  to  all  aspects  of  objective  2  as  classification  analysis, 
which  is  the  subject  of  Chapter  9.  Unfortunately,  there  is  no  general  agreement  with 
regard  to  usage  of  the  terms  discriminant  analysis  and  discriminant  functions.  Many 
writers,  perhaps  the  majority,  use  the  term  discriminant  analysis  in  connection  with 
the  second  objective,  prediction  or  allocation.  The  linear  functions  contributing  to 
the  first  objective,  description  of  group  separation,  are  often  referred  to  as  canonical 
variates  or  discriminant  coordinates.  To  avoid  confusion,  we  prefer  to  reserve  the 
term  canonical  for  canonical  correlation  analysis  in  Chapter  11. 

Discriminant  functions  are  linear  combinations  of  variables  that  best  separate 
groups.  They  were  introduced  in  Section  5.5  for  two  groups  and  in  Sections  6.1.4 
and  6.4  for  several  groups.  In  those  sections,  interest  was  centered  on  follow-up  to 
Hotelling’s  7'2-tests  and  MANOVA  tests.  In  this  chapter,  we  further  develop  these 
useful  multivariate  tools. 
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8.2  THE  DISCRIMINANT  FUNCTION  FOR  TWO  GROUPS 

We  assume  that  the  two  populations  to  be  compared  have  the  same  covariance  matrix 
2  but  distinct  mean  vectors  pi\  and  /x2-  We  work  with  samples  yi  i ,  yi2, . . .  ,  yi;il  and 
y2t,  V'22- . . .  ,  y 2m  from  the  two  populations.  As  usual,  each  vector  y,2  consists  of 
measurements  on  p  variables.  The  discriminant  function  is  the  linear  combination  of 
these  p  variables  that  maximizes  the  distance  between  the  two  (transformed)  group 
mean  vectors.  A  linear  combination  z  =  a'v  transforms  each  observation  vector  to  a 
scalar: 


zu  —  a'yu  =  aiym  +  a2yu2  H - H  aPyuP,  i  =  1,2,...  ,n i 

zn  —  a'y2j  =  aiym  +  a2y2i2  H - h  apy2ip,  i  =  1,2, ,  n2. 


Hence  the  n\  +  n2  observation  vectors  in  the  two  samples, 

yn  y2i 

yi2  y22 

yi«i  y2«2’ 

are  transformed  to  scalars, 

Zll  Z2\ 

Z12  Z22 

Zln\  Z2ri2- 

We  find  the  means  z l  —  Yl'ILiZn/rn  =  a'yj  and  Z2  —  a'y2  by  (3.54),  where 
y)  =  i  yu/ni  and  y2  =  X]/=i  y2i/n2.  We  then  wish  to  find  the  vector  a  that 
maximizes  the  standardized  difference  (zi  —  zi) /sz .  Since  (z\  —zz) /sz  can  be  nega¬ 
tive,  we  use  the  squared  distance  (z\  —  Z2)/sz,  which,  by  (3.54)  and  (3.55),  can  be 
expressed  as 


(zt-22)2  [a'(yi  -  y2)]2  ,c  ,, 

=  ■  1811 

The  maximum  of  (8.1)  occurs  when 

a  =  Spi1(yt  -T2),  (8-2) 

or  when  a  is  any  multiple  of  1  (v |  —  y2).  Thus  the  maximizing  vector  a  is  not 
unique.  However,  its  “direction”  is  unique;  that  is,  the  relative  values  or  ratios  of 
a\,  a2, . . .  ,  ap  are  unique,  and  z  =  a'y  projects  points  y  onto  the  line  on  which 
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(zi  —  z2)2 /s2  is  maximized.  Note  that  in  order  for  S  j  to  exist,  we  must  have  n\  + 
ft 2  -2  >  p. 

The  optimum  direction  given  by  a  =  1  (y,  —  y2)  is  effectively  parallel  to  the 

line  joining  yj  and  y2,  because  the  squared  distance  (zj  —  z,2)/s2  is  equivalent  to 
the  standardized  distance  between  yj  and  y2.  This  can  be  seen  by  substituting  (8.2) 
into  (8.1)  to  obtain 


7"2  =  (yt  -  y2)'spi1  (yi  -  y2)  (8.3) 

sz 

for  z  =  a'y  with  a  =  S^1  (y,  —  y2).  Since  a'  =  (yx  —  we  can  write  (8.3) 

as  (zi  —  Z2)2/s 2  —  a'(yi  —  y2),  and  any  other  direction  than  that  represented  by 
a  =  Sp|  (y|  —  y2)  would  yield  a  smaller  difference  between  a'v  |  and  a'y2  (see 
Section  5.5). 

Figure  8.1  illustrates  the  separation  of  two  bivariate  normal  (p  =  2)  groups  along 
the  single  dimension  represented  by  the  discriminant  function  z  =  a'y,  where  a  is 
given  by  (8.2).  In  this  illustration  the  population  covariance  matrices  are  equal.  The 
linear  combinations  zu  =  a'yu  =  aiym+a2yu2  and  Z2i  =  a'y2  /  =  fliy2/i  +a2yn2 
project  the  points  y  i/  and  y2/  onto  the  line  of  optimum  separation  of  the  two  groups. 
Since  the  two  variables  vi  and  y2  are  bivariate  normal,  a  linear  combination  z  = 
fliyi  +  «2  v2  =  a'y  is  univariate  normal  (see  property  la  in  Section  4.2).  We  have 
therefore  indicated  this  by  two  univariate  normal  densities  along  the  line  represent¬ 
ing  z. 

The  point  where  the  line  joining  the  points  of  intersection  of  the  two  ellipses  inter¬ 
sects  the  discriminant  function  line  z  is  the  point  of  maximum  separation  (minimum 


Figure  8.1.  Two-group  discriminant  analysis. 
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z' 


overlap)  of  points  projected  onto  the  line.  If  the  two  populations  are  multivariate  nor¬ 
mal  with  common  covariance  matrix  X,  as  illustrated  in  Figure  8.1,  it  can  be  shown 
that  all  possible  group  separation  is  expressed  in  a  single  new  dimension. 

In  Figure  8.2,  we  illustrate  the  optimum  separation  achieved  by  the  discriminant 
function.  Projection  in  another  direction  denoted  by  z'  gives  a  smaller  standardized 
distance  between  the  transformed  means  z\  and  z'2  and  also  a  larger  overlap  between 
the  projected  points. 

Example  8.2.  Samples  of  steel  produced  at  two  different  rolling  temperatures  are 
compared  in  Table  8.1  (Kramer  and  Jensen  1969a).  The  variables  are  vi  =  yield 


Table  8.1.  Yield  Point  and  Ultimate  Strength  of  Steel 
Produced  at  Two  Rolling  Temperatures 


Temperature  1 

Temperature  2 

yi 

yi 

Vi 

T2 

33 

60 

35 

57 

36 

61 

36 

59 

35 

64 

38 

59 

38 

63 

39 

61 

40 

65 

41 

63 

43 

65 

41 

59 
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35  40 

Yield  point,  Yj 


Figure  8.3.  Ultimate  strength  and  yield  point  for  steel  rolled  at  two  temperatures. 


point  and  yo  =  ultimate  strength.  From  the  data,  we  calculate 

_  _  /  36.4  \  _  _  /  39.0  \  _  /  7.92  5.68  \ 

yi  _  ^  62.6  )’  y~  y  60.4  )’  Spl  “  \  5.68  6.29  )  ' 

A  plot  of  the  data  appears  in  Figure  8.3.  We  see  that  if  the  points  were  projected 

on  either  the  yi  or  the  V2  axis,  there  would  be  considerable  overlap.  In  fact,  when  the 

two  groups  are  compared  by  means  of  a  /-statistic  for  each  variable  separately,  both 
t’s  are  nonsignificant: 


h=ynzM  =-i.58. 

VUl(l/«l  +  l/«2) 


u=  ,  yi2~y22 - =  1.48. 

VY22(1/«1  +  l/«2) 


However,  it  is  clear  in  Figure  8.3  that  the  two  groups  can  be  separated.  If  they 
are  projected  in  an  appropriate  direction,  as  in  Figure  8.1,  there  will  be  no  overlap. 
The  single  dimension  onto  which  the  points  would  be  projected  is  the  discriminant 
function 


z  =  a'y  =  a\y\  +  a^yi  —  -1.633yi  +  1.820.V2, 
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Table  8.2.  Discriminant  Function  z  =  —  1 .633y  [  + 1 .820vi 
Evaluated  for  Data  in  Table  8.1 


Temperature  1 

Temperature  2 

55.29 

46.56 

52.20 

48.57 

59.30 

45.30 

52.58 

47.30 

52.95 

47.68 

48.05 

40.40 

where  a  is  obtained  as 


a  =  Spll(y,-y2)  =  (  )  • 

The  values  of  the  projected  points  are  found  by  calculating  z  for  each  observation 
vector  y  in  the  two  groups.  The  results  are  given  in  Table  8.2,  where  the  separation 
provided  by  the  discriminant  function  is  clearly  evident.  □ 


8.3  RELATIONSHIP  BETWEEN  TWO-GROUP  DISCRIMINANT 
ANALYSIS  AND  MULTIPLE  REGRESSION 

The  mutual  connection  between  multiple  regression  and  two-group  discriminant 
analysis  was  introduced  as  a  computational  device  in  Section  5.6.2.  The  roles  of 
independent  and  dependent  variables  are  reversed  in  the  two  models.  The  depen¬ 
dent  variables  (y’s)  of  discriminant  analysis  become  the  independent  variables  in 
regression. 

Let  w  be  a  grouping  variable  (identifying  groups  1  and  2)  such  that  w  =  0  and 
define  b  =  {b\,  b2,  ■  ■  ■  ,  bp)'  as  the  vector  of  regression  coefficients  when  w  is  fit 
to  the  y’s.  Then  by  (5.21),  b  is  proportional  to  the  discriminant  function  coefficient 
vector  a  =  Spj1  (y !  -y2): 

/7|«2 

b  = - — - ^-a,  (8.4) 

(n i  +  n2)(n\  +  «2  -  2  +  T 2) 

where  T2  =  [nin2/(n\  +  «2)](yi  -  y2),S”11(yi  -  y2)  as  in  (5.9).  From  (5.20)  the 
squared  multiple  correlation  R 2  is  related  to  T2  by 


T  2 

n\  +  til  —  2  +  T2 


R 2  =  (yi  -  y2)'b  = 


(8.5) 
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The  test  statistic  (5.29)  for  the  hypothesis  that  q  of  the  p+q  variables  are  redundant 
for  separating  the  groups  can  also  be  obtained  in  terms  of  regression  by  (5.31)  as 


F  = 


n  1  +  «2  -  p  -  q 

q 


-  1  K+q 


K 


1  -  R 


2 

p+q 


(8.6) 


where  R2,+q  and  R2  are  from  regressions  with  p  +  q  and  p  variables,  respectively. 

The  link  between  two-group  discriminant  analysis  and  multiple  regression  was 
first  noted  by  Fisher  (1936).  Flury  and  Riedwyl  (1985)  give  further  insights  into  the 
relationship. 


Example  8.3.  In  Example  5.6.2,  the  psychological  data  of  Table  5.1  were  used  in  an 
illustration  of  the  regression  approach  to  computation  of  a  and  T2.  We  use  the  same 
data  to  obtain  b  and  R2  from  a  and  T2. 

From  the  results  of  Examples  5.4.2  and  5.5,  we  have 


T 2  =  97.6015, 

/  .5104  \ 

-.2033 
3  “  .4660 

v  -.3097  ) 

To  find  b  from  a  and  T2,  we  use  (8.4): 


( 

_ (32)  (32) _ 

(32  +  32)(32  +  32  -  2  +  97.6015)  3 

V 


.051  \ 
-.020 
.047 
-.031 


To  find  R2,  we  use  (8.5): 


/ 

R2  =  (yt  -  y2)'b  = 

V 


3.625  \ 

/ 

(  .051  \ 

2.000 

-.020 

10.531 

.047 

.812  ) 

l  -  °31  / 

.611. 


We  can  also  use  the  relationship  with  T 2  in  (8.5): 


R2 


T 2 

n\  +  ni  —  2  +  T2 


97.6105 

32  +  32-  2  +  97.6015 


.611. 


□ 
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8.4  DISCRIMINANT  ANALYSIS  FOR  SEVERAL  GROUPS 
8.4.1  Discriminant  Functions 

In  discriminant  analysis  for  several  groups,  we  are  concerned  with  finding  linear 
combinations  of  variables  that  best  separate  the  k  groups  of  multivariate  observa¬ 
tions.  Discriminant  analysis  for  several  groups  may  serve  any  one  of  various  pur¬ 
poses: 

1.  Examine  group  separation  in  a  two-dimensional  plot.  When  there  are  more 
than  two  groups,  it  requires  more  than  one  discriminant  function  to  describe 
group  separation.  If  the  points  in  the  /^-dimensional  space  are  projected  onto  a 
two-dimensional  space  represented  by  the  first  two  discriminant  functions,  we 
obtain  the  best  possible  view  of  how  the  groups  are  separated. 

2.  Find  a  subset  of  the  original  variables  that  separates  the  groups  almost  as  well 
as  the  original  set.  This  topic  was  introduced  in  Section  6.11 .2. 

3.  Rank  the  variables  in  terms  of  their  relative  contribution  to  group  separation. 
This  use  for  discriminant  functions  has  been  mentioned  in  Sections  5.5,  6.1.4, 
6.1.8,  and  6.4.  In  Section  8.5,  we  discuss  standardized  discriminant  function 
coefficients  that  provide  a  more  valid  comparison  of  the  variables. 

4.  Interpret  the  new  dimensions  represented  by  the  discriminant  functions. 

5.  Follow  up  to  fixed-effects  MANOVA. 

Purposes  3  and  4  are  closely  related.  Any  of  the  first  four  can  be  used  to  accom¬ 
plish  purpose  5.  Methods  of  achieving  these  five  goals  of  discriminant  analysis  are 
discussed  in  subsequent  sections.  In  the  present  section  we  review  discriminant  func¬ 
tions  for  the  several-group  case  and  discuss  attendant  assumptions.  For  alternative 
estimators  of  discriminant  functions  that  may  be  useful  in  the  presence  of  multi- 
collinearity  or  outliers,  see  Rencher  (1998,  Section  5.11). 

For  k  groups  (samples)  with  n,  observations  in  the  /  th  group,  we  transform  each 
observation  vector  y,2  to  obtain  Zij  =  a'y ,j,  i  =  1,  2, . . .  ,  k;  j  =  1,2,..., 
and  find  the  means  zt  =  a'y,-,  where  y;-  =  yij/ni-  As  in  the  two-group  case, 

we  seek  the  vector  a  that  maximally  separates  zi,Z2,  •••  ,  Zk-  To  express  separation 
among  zi,  Z2,  ■  ■  ■  ,  Zk,  we  extend  the  separation  criterion  (8.1)  to  the  k-group  case. 
Since  a'(yi  —  y2)  =  (yi  —  y2)'a,  we  can  express  (8.1)  in  the  form 

gi  -  z2)2  [a/(y1  -  y2)]2  a'(yt  -  y2)(yt  -  y2)'a 

s}  a'Sppi  a'Spia 


To  extend  (8.7)  to  k  groups,  we  use  the  H  matrix  from  MANOVA  in  place  of 
(yi  —  y2) (y)  —  y2)'  [see  (6.38)]  and  E  in  place  of  Spi  to  obtain 
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which  can  also  be  expressed  as 


SSH(z) 
SSE(z)  ’ 


(8.9) 


where  SSH(z)  and  SSE(z)  are  the  between  and  within  sums  of  squares  for  z  [see 
(6.42)]. 

We  can  write  (8.8)  in  the  form 


a'Ha  =  AaEa, 

a' (Ha  —  kEa)  =  0.  (8.10) 

We  examine  values  of  A.  and  a  that  are  solutions  of  (8.10)  in  a  search  for  the  value  of 
a  that  results  in  maximum  X.  The  solution  a!  =  O'  is  not  permissible  because  it  gives 
X  —  0/0  in  (8.8).  Other  solutions  are  found  from 

Ha  —  AEa  =  0,  (8.11) 


which  can  be  written  in  the  form 

(E“XH  -  AI)a  =  0.  (8.12) 

The  solutions  of  (8.12)  are  the  eigenvalues  Ai,  X2, . . .  ,XS  and  associated  eigenvec¬ 
tors  ai,  a2, . . .  ,  as  of  E~*H.  As  in  previous  discussions  of  eigenvalues,  we  consider 
them  to  be  ranked  X\  >  X2  >  ■  ■  ■  >  Xs.  The  number  of  (nonzero)  eigenvalues  s  is  the 
rank  of  H,  which  can  be  found  as  the  smaller  of  k  —  1  or  p.  Thus  the  largest  eigen¬ 
value  /,  1  is  the  maximum  value  of  a  =  a'Ha/a'Ea  in  (8.8),  and  the  coefficient  vector 
that  produces  the  maximum  is  the  corresponding  eigenvector  ai.  (This  can  be  ver¬ 
ified  using  calculus.)  Hence  the  discriminant  function  that  maximally  separates  the 
means  is  z  1  =  ajy;  that  is,  z  1  represents  the  dimension  or  direction  that  maximally 
separates  the  means. 

From  the  s  eigenvectors  ai,  a2, . . .  ,  av  of  E_1H  corresponding  to  X \,Xi_,  . . .  ,XS, 
we  obtain  .y  discriminant  functions  zi  —  a^y,  zi  =  a/y, ...  ,  zs  =  a'.y,  which  show 
the  dimensions  or  directions  of  differences  among  yl5  y2, . . .  ,  y*.  These  discrimi¬ 
nant  functions  are  uncorrelated,  but  they  are  not  orthogonal  (a(a/  =  0  for  i  ^  j) 
because  E-1H  is  not  symmetric  [see  Rencher  (1998,  pp.  203-204)].  Note  that  the 
numbering  zi,  Z2,  ■  ■  ■  ,  Zs  corresponds  to  the  eigenvalues,  not  to  the  k  groups  as  was 
done  earlier  in  this  section. 

The  relative  importance  of  each  discriminant  function  Zi  can  be  assessed  by  con¬ 
sidering  its  eigenvalue  as  a  proportion  of  the  total: 

(8.13) 


By  this  criterion,  two  or  three  discriminant  functions  will  often  suffice  to  describe  the 
group  differences.  The  discriminant  functions  associated  with  small  eigenvalues  can 
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be  neglected.  A  test  of  significance  for  each  discriminant  function  is  also  available 
(see  Section  8.6). 

The  matrix  E- 1 H  is  not  symmetric.  Many  algorithms  for  computation  of  eigenval¬ 
ues  and  eigenvectors  accept  only  symmetric  matrices.  In  Section  6.1.4,  it  was  shown 
that  the  eigenvalues  of  the  symmetric  matrix  (U-1yHU-1  are  the  same  as  those  of 
E_1H,  where  E  =  U'U  is  the  Cholesky  factorization  of  E.  However,  an  adjustment 
is  needed  for  the  eigenvectors.  If  b  is  an  eigenvector  of  (U_1)'HU  ,  then  a  =  U-1b 

is  an  eigenvector  of  E_1H. 

The  preceding  discussion  was  presented  in  terms  of  unequal  sample  sizes 
«i,  M2,  . . .  ,  rik-  In  applications,  this  situation  is  common  and  can  be  handled  with  no 
difficulty.  Ideally,  the  smallest  m,  should  exceed  the  number  of  variables,  p.  This  is 
not  required  mathematically  but  will  lead  to  more  stable  discriminant  functions. 


Example  8.4.1.  The  data  in  Table  8.3  were  collected  by  G.  R.  Bryce  and  R.  M.  Barker 
(Brigham  Young  University)  as  part  of  a  preliminary  study  of  a  possible  link  between 
football  helmet  design  and  neck  injuries. 

Six  head  measurements  were  made  on  each  subject.  There  were  30  subjects  in 
each  of  three  groups:  high  school  football  players  (group  1),  college  football  players 
(group  2),  and  nonfootball  players  (group  3).  The  six  variables  are 


WDIM  =  head  width  at  widest  dimension, 
CIRCUM  =  head  circumference, 

FBEYE  =  front-to-back  measurement  at  eye  level, 
EYEHD  =  eye-to-top-of-head  measurement, 
EARHD  =  ear-to-top-of-head  measurement, 

JAW  =  jaw  width. 


eigenvectors  are 


of  E  1 H  are  M  = 

1.9178  and  k2  =  .1159 

/  -.948  \ 

/  -1.407  \ 

.004 

.001 

.006 

.029 

ai  = 

.647 

,  a2  = 

-.540 

.504 

.384 

V  -829  > 

v  1.529  / 

The  first  eigenvalue  accounts  for  a  substantial  proportion  of  the  total: 


A.1  1.9178 

k[  +  X2  ~  1.9178+  .1159 


Thus  the  mean  vectors  lie  largely  in  one  dimension,  and  one  discriminant  function 
suffices  to  describe  most  of  the  separation  among  the  three  groups.  □ 
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Table  8.3.  Head  Measurements  for  Three  Groups 


Group 

WDIM 

CIRCUM 

FBEYE 

EYEHD 

EARHD 

JAW 

1 

13.5 

57.2 

19.5 

12.5 

14.0 

11.0 

1 

15.5 

58.4 

21.0 

12.0 

16.0 

12.0 

1 

14.5 

55.9 

19.0 

10.0 

13.0 

12.0 

1 

15.5 

58.4 

20.0 

13.5 

15.0 

12.0 

1 

14.5 

58.4 

20.0 

13.0 

15.5 

12.0 

1 

14.0 

61.0 

21.0 

12.0 

14.0 

13.0 

1 

15.0 

58.4 

19.5 

13.5 

15.5 

13.0 

1 

15.0 

58.4 

21.0 

13.0 

14.0 

13.0 

1 

15.5 

59.7 

20.5 

13.5 

14.5 

12.5 

1 

15.5 

59.7 

20.5 

13.0 

15.0 

13.0 

1 

15.0 

57.2 

19.0 

14.0 

14.5 

11.5 

1 

15.5 

59.7 

21.0 

13.0 

16.0 

12.5 

1 

16.0 

57.2 

19.0 

14.0 

14.5 

12.0 

1 

15.5 

62.2 

21.5 

14.0 

16.0 

12.0 

1 

15.5 

57.2 

19.5 

13.5 

15.0 

12.0 

1 

14.0 

61.0 

20.0 

15.0 

15.0 

12.0 

1 

14.5 

58.4 

20.0 

12.0 

14.5 

12.0 

1 

15.0 

56.9 

19.0 

13.0 

14.0 

12.5 

1 

15.5 

59.7 

20.0 

12.5 

14.0 

12.5 

1 

15.0 

57.2 

19.5 

12.0 

14.0 

11.0 

1 

15.0 

56.9 

19.0 

12.0 

13.0 

12.0 

1 

15.5 

56.9 

19.5 

14.5 

14.5 

13.0 

1 

17.5 

63.5 

21.5 

14.0 

15.5 

13.5 

1 

15.5 

57.2 

19.0 

13.0 

15.5 

12.5 

1 

15.5 

61.0 

20.5 

12.0 

13.0 

12.5 

1 

15.5 

61.0 

21.0 

14.5 

15.5 

12.5 

1 

15.5 

63.5 

21.8 

14.5 

16.5 

13.5 

1 

14.5 

58.4 

20.5 

13.0 

16.0 

10.5 

1 

15.5 

56.9 

20.0 

13.5 

14.0 

12.0 

1 

16.0 

61.0 

20.0 

12.5 

14.5 

12.5 

2 

15.5 

60.0 

21.1 

10.3 

13.4 

12.4 

2 

15.4 

59.7 

20.0 

12.8 

14.5 

11.3 

2 

15.1 

59.7 

20.2 

11.4 

14.1 

12.1 

2 

14.3 

56.9 

18.9 

11.0 

13.4 

11.0 

2 

14.8 

58.0 

20.1 

9.6 

11.1 

11.7 

2 

15.2 

57.5 

18.5 

9.9 

12.8 

11.4 

2 

15.4 

58.0 

20.8 

10.2 

12.8 

11.9 

2 

16.3 

58.0 

20.1 

8.8 

13.0 

12.9 

2 

15.5 

57.0 

19.6 

10.5 

13.9 

11.8 

2 

15.0 

56.5 

19.6 

10.4 

14.5 

12.0 

2 

15.5 

57.2 

20.0 

11.2 

13.4 

12.4 

2 

15.5 

56.5 

19.8 

9.2 

12.8 

12.2 

2 

15.7 

57.5 

19.8 

11.8 

12.6 

12.5 

2 

14.4 

57.0 

20.4 

10.2 

12.7 

12.3 

2 

14.9 

54.8 

18.5 

11.2 

13.8 

11.3 
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Table  8.3.  (Continued) 


Group 

WDIM 

CIRCUM 

FBEYE 

EYEHD 

EARHD 

JAW 

2 

16.5 

59.8 

20.2 

9.4 

14.3 

12.2 

2 

15.5 

56.1 

18.8 

9.8 

13.8 

12.6 

2 

15.3 

55.0 

19.0 

10.1 

14.2 

11.6 

2 

14.5 

55.6 

19.3 

12.0 

12.6 

11.6 

2 

15.5 

56.5 

20.0 

9.9 

13.4 

11.5 

2 

15.2 

55.0 

19.3 

9.9 

14.4 

11.9 

2 

15.3 

56.5 

19.3 

9.1 

12.8 

11.7 

2 

15.3 

56.8 

20.2 

8.6 

14.2 

11.5 

2 

15.8 

55.5 

19.2 

8.2 

13.0 

12.6 

2 

14.8 

57.0 

20.2 

9.8 

13.8 

10.5 

2 

15.2 

56.9 

19.1 

9.6 

13.0 

11.2 

2 

15.9 

58.8 

21.0 

8.6 

13.5 

11.8 

2 

15.5 

57.3 

20.1 

9.6 

14.1 

12.3 

2 

16.5 

58.0 

19.5 

9.0 

13.9 

13.3 

2 

17.3 

62.6 

21.5 

10.3 

13.8 

12.8 

3 

14.9 

56.5 

20.4 

7.4 

13.0 

12.0 

3 

15.4 

57.5 

19.5 

10.5 

13.8 

11.5 

3 

15.3 

55.4 

19.2 

9.7 

13.3 

11.5 

3 

14.6 

56.0 

19.8 

8.5 

12.0 

11.5 

3 

16.2 

56.5 

19.5 

11.5 

14.5 

11.8 

3 

14.6 

58.0 

19.9 

13.0 

13.4 

11.5 

3 

15.9 

56.7 

18.7 

10.8 

12.8 

12.6 

3 

14.7 

55.8 

18.7 

11.1 

13.9 

11.2 

3 

15.5 

58.5 

19.4 

11.5 

13.4 

11.9 

3 

16.1 

60.0 

20.3 

10.6 

13.7 

12.2 

3 

15.2 

57.8 

19.9 

10.4 

13.5 

11.4 

3 

15.1 

56.0 

19.4 

10.0 

13.1 

10.9 

3 

15.9 

59.8 

20.5 

12.0 

13.6 

11.5 

3 

16.1 

57.7 

19.7 

10.2 

13.6 

11.5 

3 

15.7 

58.7 

20.7 

11.3 

13.6 

11.3 

3 

15.3 

56.9 

19.6 

10.5 

13.5 

12.1 

3 

15.3 

56.9 

19.5 

9.9 

14.0 

12.1 

3 

15.2 

58.0 

20.6 

11.0 

15.1 

11.7 

3 

16.6 

59.3 

19.9 

12.1 

14.6 

12.1 

3 

15.5 

58.2 

19.7 

11.7 

13.8 

12.1 

3 

15.8 

57.5 

18.9 

11.8 

14.7 

11.8 

3 

16.0 

57.2 

19.8 

10.8 

13.9 

12.0 

3 

15.4 

57.0 

19.8 

11.3 

14.0 

11.4 

3 

16.0 

59.2 

20.8 

10.4 

13.8 

12.2 

3 

15.4 

57.6 

19.6 

10.2 

13.9 

11.7 

3 

15.8 

60.3 

20.8 

12.4 

13.4 

12.1 

3 

15.4 

55.0 

18.8 

10.7 

14.2 

10.8 

3 

15.5 

58.4 

19.8 

13.1 

14.5 

11.7 

3 

15.7 

59.0 

20.4 

12.1 

13.0 

12.7 

3 

17.3 

61.7 

20.7 

11.9 

13.3 

13.3 
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8.4.2  A  Measure  of  Association  for  Discriminant  Functions 

Measures  of  association  between  the  dependent  variables  yi,  yj,  ■  ■  ■  ,  yp  and  the 
independent  grouping  variable  i  associated  with  /x, ,  i  =  1,2,...  ,  k,  were  presented 
in  Section  6.1.8.  These  measures  attempt  to  answer  the  question.  How  well  do  the 
variables  separate  the  groups?  it  was  noted  that  Roy’s  statistic  9  serves  as  an  R2-like 
measure  of  association,  since  it  is  the  ratio  of  between  to  total  sum  of  squares  for  the 
first  discriminant  function,  z l  =  a',  y: 


ki  SSH(zi) 

1  +  M  ~  SSE(zi)  +  SSH(zi) 


[see  (6.42)  and  (8.9)].  Another  interpretation  of  r]g  is  the  maximum  squared  cor¬ 
relation  between  the  first  discriminant  function  and  the  best  linear  combination  of 
the  k  —  1  (dummy)  group  membership  variables  [see  a  comment  following  (6.40) 
in  Section  6.1.8].  Dummy  variables  were  defined  in  the  first  two  paragraphs  of  Sec¬ 
tion  6.1.8.  The  maximum  correlation  is  called  the  ( first)  canonical  correlation  (see 
Chapter  11).  The  squared  canonical  correlation  can  be  calculated  for  each  discrimi¬ 
nant  function: 


rf  = - — ,  t  =  1,2, ...  ,s.  (8.14) 

'  1  +  kj 

The  average  squared  canonical  correlation  was  used  as  a  measure  of  association  in 
(6.49). 

Example  8.4.2.  For  the  football  data  of  Table  8.3,  we  obtain  the  squared  canonical 
correlation  between  each  of  the  two  discriminant  functions  and  the  grouping  vari¬ 
ables. 


r2  - 

1  1  +  M 

2  1  +  A2 


1.9178 
1  +  1.9178 
.1159 

1  +  .1159  ' 


.657, 

104. 


□ 


8.5  STANDARDIZED  DISCRIMINANT  FUNCTIONS 

In  Section  5.5,  it  was  noted  that  in  the  two-group  case  the  relative  contribution  of  the 
y’s  to  separation  of  the  two  groups  can  best  be  assessed  by  comparing  the  coefficients 
ar,  r  =  1,2,...  ,  p,  in  the  discriminant  function 

z  —  a'y  =  « I  >’ l  +  aiyi  -A - b  apyp. 

Similar  comments  appeared  in  Section  6. 1 .4,  6.1.8,  and  6.4  concerning  the  use  of  dis¬ 
criminant  functions  to  assess  contribution  of  the  y’s  to  separation  of  several  groups. 
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However,  such  comparisons  are  informative  only  if  the  y’s  are  commensurate,  that 
is,  measured  on  the  same  scale  and  with  comparable  variances.  If  the  y’s  are  not 
commensurate,  we  need  coefficients  a*  that  are  applicable  to  standardized  variables. 

Consider  the  case  of  two  groups.  For  the  ith  observation  vector  yi,  or  y2 ;  in 
group  1  or  2,  we  can  express  the  discriminant  function  in  terms  of  standardized  vari¬ 
ables  as 


*ym-y n  ,  *yu2-yn  .  .  *yuP-yip 

zii  —  a\ - h  a2 - 1 - 1-  a„ - , 

H  52  P  Sp 

i  =  1,2,...  ,  n  i , 

.  yin  —  y 2i  .  *  ym  —  y22 


(8.15) 


Z2i  —  a , ; - —  +  fib 

51  “  52 

1  =  1,2,...,  n2. 


+  •  •  •  +  G: 


:  y2iP  -  y2p 


where  y,  =  (yu ,  yi2,  •  •  •  ,  y\p)  and  y'2  =  (y21,  y22,  ■■■  ,  y2p )  are  the  mean  vectors 
for  the  two  groups,  and  sr  is  the  within-sample  standard  deviation  of  the  rth  vari¬ 
able,  obtained  as  the  square  root  of  the  rth  diagonal  element  of  Spi.  Clearly,  these 
standardized  coefficients  must  be  of  the  form 


a*  =  srar ,  r  =  1,2,...  ,  p.  (8.16) 

In  vector  form,  this  becomes 


a*  =  (diagSpi)1/2a.  (8.17) 

For  the  several-group  case,  we  can  standardize  the  discriminant  functions  in  an 
analogous  fashion.  If  we  denote  the  rth  coefficient  in  the  m  th  discriminant  function 
by  cimr,  m  =  1,2,...  ,  5;  r  =  1,  2, . . .  ,  p,  then  the  standardized  form  is 

*  _ 

^ mr  —  Sr&mr  5 

where  sr  is  the  within-group  standard  deviation  obtained  from  the  diagonal  of  Spi  = 
E/ve.  Note  that  a*u.  has  two  subscripts  because  there  are  several  discriminant  func¬ 
tions,  whereas  a*  in  (8.16)  has  only  one  subscript  because  there  is  one  discriminant 
function  for  two  groups. 

Alternatively,  since  the  wth  eigenvector  is  unique  only  up  to  multiplication  by  a 
scalar,  we  can  simplify  the  standardization  by  using 

amr  —  V^amr,  r  —  1,2,...  ,  p, 

where  err  is  the  rth  diagonal  element  of  E.  For  further  discussion  of  the  use  of  stan¬ 
dardized  discriminant  function  coefficients  to  gauge  the  relative  contribution  of  the 
variables  to  group  separation,  see  Section  8.7.1  [see  also  Rencher  and  Scott  (1990) 
and  Rencher  (1998,  Section  5.4)]. 
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Example  8.5.  In  Example  8.4.1,  we  obtained  the  discriminant  function  coefficient 
vectors  aj  and  a2  for  the  football  data  of  Table  8.3.  Since  A.  i  / ( /, i  +  >, 2 )  =  -94,  we 
concentrate  on  ai .  To  standardize  ai,  we  need  the  within- sample  standard  deviations 
of  the  variables.  The  pooled  covariance  matrix  is  given  by 


.428 

.578 

.158 

.084 

.125 

.228 

.578 

3.161 

1.020 

.653 

.340 

.505 

.158 

1.020 

.546 

.077 

.129 

.159 

.084 

.653 

.077 

1.232 

.315 

.024 

.125 

.340 

.129 

.315 

.618 

.009 

.228 

.505 

.159 

.024 

.009 

.376 

Using  the  square  roots  of  the  diagonal  elements  of  Spi,  we  obtain 


/  -.621  \ 

/  V.428(— .948)  \ 

.007 

* 

V3.16U.004) 

.005 

ai  = 

.719 

\  V376C829)  j 

.397 

i  .508  y 

Thus  the  fourth,  first,  sixth,  and  fifth  variables  contribute  most  to  separating  the 
groups,  in  that  order.  The  second  and  third  variables  are  not  useful  (in  the  presence 
of  the  others)  in  distinguishing  groups.  □ 


8.6  TESTS  OF  SIGNIFICANCE 

In  order  to  test  hypotheses,  we  need  the  assumption  of  multivariate  normality.  This 
was  not  explicitly  required  for  the  development  of  discriminant  functions. 

8.6.1  Tests  for  the  Two-Group  Case 

By  (8.3)  we  see  that  the  separation  of  transformed  means,  (zi  —  Z2 ) 2 /  v ? ,  achieved  by 
the  discriminant  function  z  =  a'v  is  equivalent  to  the  standardized  distance  between 
the  mean  vectors  yj  and  y2.  This  standardized  distance  is  proportional  to  the  two- 
group  T2  in  (5.9)  in  Section  5.4.2.  Hence  the  discriminant  function  coefficient  vector 
a  is  significantly  different  from  0  if  T 2  is  significant.  More  formally,  if  the  population 
discriminant  function  coefficient  vector  is  expressed  as  a  =  2“ ’(Ml  —  M 2),  then 
Hq  :  a  —  0  is  equivalent  to  Ho:  fx\  —  fxi. 

To  test  the  significance  of  a  subset  of  the  discriminant  function  coefficients,  we 
can  use  the  test  of  the  corresponding  subset  of  y’s  given  in  Section  5.9.  To  test  the 
hypothesis  that  the  population  discriminant  function  has  a  specified  form  apy,  see 
Rencher  (1998,  Section  5.5.1). 
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8.6.2  Tests  for  the  Several-Group  Case 

In  Section  8.4.1  we  noted  that  the  discriminant  criterion  X  =  a'Ha/a'Ea  is  max¬ 
imized  by  A.i,  the  largest  eigenvalue  of  E_1H,  and  that  the  remaining  eigenvalues 
Xt,  ■  ■  ■  ,XS  correspond  to  other  discriminant  dimensions.  These  eigenvalues  are  the 
same  as  those  in  the  Wilks  A-test  in  (6.14)  for  significant  differences  among  mean 
vectors, 


Al 


n 


i 

1  +  Xj 


(8.18) 


which  is  distributed  as  Ap>k-\^-k,  where  N  =  Yli  ni  f°r  an  unbalanced  design  or 
N  —  kn  in  the  balanced  case.  Since  A  \  is  small  if  one  or  more  X ; ’s  are  large,  Wilks’ 
A  tests  for  significance  of  the  eigenvalues  and  thereby  for  the  discriminant  func¬ 
tions.  The  s  eigenvalues  represent  s  dimensions  of  separation  of  the  mean  vectors 
yx,  y2, . . .  ,  yk.  We  are  interested  in  which,  if  any,  of  these  dimensions  are  signifi¬ 
cant.  In  the  context  of  discriminant  functions,  Wilks’  A  is  more  useful  than  the  other 
three  MANOVA  test  statistics,  because  it  can  be  used  on  a  subset  of  eigenvalues,  as 
we  see  shortly. 

In  addition  to  the  exact  test  provided  by  the  critical  values  for  A  found  in 
Table  A. 9,  we  can  use  the  /“^-approximation  for  A  i  given  in  (6.16),  with  vE  — 
N  —  k  —  JA  iti  —  k  and  vh  =  k  —  1: 


Vi  =  ~[vE  -  \{p  -  vH  +  1)]  In  Ai 

=  —[N  —  1  —  j(p  +  &)]  In T~r 

i=i  1  +  A' 

=  [N-l-  \{p  +  k)]  £ln(l  +  Xi),  (8.19) 

i=i 

which  is  approximately  /2  with  p(k  —  1)  degrees  of  freedom.  The  test  statistic  Aj 
and  its  approximation  (8.19)  test  the  significance  of  all  of  X\ ,  Xj. ...  ,  Xs.  If  the  test 
leads  to  rejection  of  Hq ,  we  conclude  that  at  least  one  of  the  X’s  is  significantly 
different  from  zero,  and  therefore  there  is  at  least  one  dimension  of  separation  of 
mean  vectors.  Since  /.]  is  the  largest,  we  are  only  sure  of  its  significance,  along  with 
that  of  zi  —  a  j  y. 

To  test  the  significance  of  X2,  X3, ...  ,  /,v,  we  delete  X  \  from  Wilks'  A  and  the 
associated  /  ^approximation  to  obtain 


A2 


•  1 


(8.20) 
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V2  =  -[N-1-\(p  +  k)]  In  A2  =  [N  —  1  —  \{p  +  *)]  J2  ln(l  +  A./),  (8.21) 

i'=2 

which  is  approximately  /  2  with  ( p  —  1 )  (A: — 2)  degrees  of  freedom.  If  this  test  leads  to 
rejection  of  Hq,  we  conclude  that  at  least  /.2  is  significant  along  with  the  associated 
discriminant  function  zi  —  ajy.  We  can  continue  in  this  fashion,  testing  each  A.,- 
in  turn  until  a  test  fails  to  reject  Hq.  (To  compensate  for  making  several  tests,  an 
adjustment  to  the  cv-level  of  each  test  could  be  made  as  in  procedure  2,  Section  5.5.) 
The  test  statistic  at  the  mth  step  is 

Am  =  fl  ITT’  (8-22) 

1  A  i 

i=m 

which  is  distributed  as  Kp-mjr\^-m,N-k-m+\  ■  The  statistic 
Vm  =  —[AT  -  1  -  \{p  +  k)]  In  Am 

=  [AT-l-i(p  +  k)]  J>(l+A,)  (8.23) 

i—m 


has  an  approximate  / 2 -distribution  with  (p  —  m  +  1  )(k  —  m)  degrees  of  freedom. 
In  some  cases,  more  A’s  will  be  statistically  significant  than  the  researcher  consid¬ 
ers  to  be  of  practical  importance.  If  X ;  /  Xj  is  small,  the  associated  discriminant 
function  may  not  be  of  interest,  even  if  it  is  significant. 

We  can  also  use  an  /-’-approximation  for  each  A,-.  For 


Ai 


n 


i 

1  +  X( 


we  use  (6.15),  with  ve  =  N  —  k  and  v#  =  k  —  1: 

r  i-A;/?df2 
a;/?  chi  ' 


(8.24) 


where 


t  — 


p2(k  -  l)2 


p2  +  (k  -  l)2  —  5 
dfi  =  p(k-  1), 


For 


A,„  — 


S  1 

nT7I? 


w  =  N  —  1  —  \(p  +  k), 
df2  =  wt  —  \[p(k-  1)  -2]. 


i=m 


m  —  2,  3, . . .  ,  s, 
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we  use 


1  -  A  )1‘  df2 

A1/f  dfi 

l'-m  1 


with  p  —  m  +  1  and  k  —  m  in  place  of  p  and  k  —  1 : 


t 


I  (p  —  m  +  l)2(fc  —  m)-  —  4 
(p  —  m  +  l)2  +  (k  —  m)~  —  i 


w  —  N  —  1  —  \(p  +  k), 
dfj  =  (p  —  m  +  1  )(k  —  m), 
df2  —  wt  —  —  m  +  \)(k  —  m)  —  2], 


(8.25) 


Example  8.6.2.  We  test  the  significance  of  the  two  discriminant  functions  obtained 
in  Example  8.4.1  for  the  football  data.  For  the  overall  test  we  have,  by  (8.18), 


At 


n 


i 

1  +  Xj 


1  1 
1  +  1.9178  1  +  .1159 


.307. 


With/?  =  6,k  —  3,  and  N  —  k  =  87,  the  critical  value  from  Table  A.9  is  A. 05, 6, 2, 80  = 
.762.  By  (8.19),  the  / 2 -approximation  is 


Vi  =  -[A  -  1 
=  -[90-  1 


^o  +  fc)]  InAi 

i(6  +  3)]  ln(.307)  =  99.75, 


which  exceeds  the  critical  value  x  01  12  =  26.217.  Thus  at  least  the  first  discriminant 
function  is  significant. 

To  test  the  second  discriminant  function,  we  have,  by  (8.20), 


A2 


i 

1  +  .1159 


.896. 


With  m  —  2,  the  (conservative)  critical  value  is  A. 05, 5,1, 80  =  .867.  Since  this  is  close 
to  A  =  .896,  we  interpolate  in  Table  A.9  to  obtain  A .05,5,1,86  =  .875.  By  (8.21),  the 
X  ^approximation  is 


N  —  l  —  l-(p  +  k) 


V2  =  - 


ln  A2 
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90-1-  -(6  +  3) 


In- 


1 


1  +  .1159 


=  9.27  < 


X.05.5 


=  11.070. 


For  the  F- approximation  for  Ai,  we  obtain  by  (8.24) 


t 

w 

dfi 

df2 

F 


I  p2(k-  l)2  -  4  /  6222  —  4 

V  P2  +  (k-  l)2  —  5  “  V  62  +  22  —  5  “  2’ 

A  -  1  -  +  k)  =  90  -  1  -  |(6  +  3)  =  84.5, 

p(k  -  1)  =  6(2)  =  12, 

wt  -  \[p(k  -  1)  -  2]  =  (84.5) (2)  -  j [6(2)  -  2]  =  164, 

1  -  a!/2  df2  1  -  .3071/2  164 

- - - -  = - =  10  994 

A  j/2  df,  .307  */2  12 


The  p-value  for  F  —  10.994  is  less  than  .0001.  For  the  F- approximation  for  A2,  we 
reduce  p  and  k  by  1  and  obtain  by  (8.25) 


5212  —  4 

— - - - =  1,  w  =  90-  1  -  1(6  +  3)  =  84.5, 

5“  +  lz  —  5 

d^  =  5(1)  =  5,  df2  =  84.5(1)  -  i[5(l)  -  2]  =  83, 

1  -  A2  df2  1  -  .896  83 

F  =  - = - =  1.924. 

A2  df[  .896  5 

The  /7-value  for  F  =  1.924  is  .099.  Thus  only  the  first  discriminant  function  signif¬ 
icantly  separates  groups.  The  exact  test  using  A2  appears  to  be  somewhat  closer  to 
rejection  than  are  the  approximate  tests.  □ 


8.7  INTERPRETATION  OF  DISCRIMINANT  FUNCTIONS 

There  is  a  close  correspondence  between  interpreting  discriminant  functions  and 
determining  the  contribution  of  each  variable,  and  we  shall  not  always  make  a  dis¬ 
tinction.  In  interpretation,  the  signs  of  the  coefficients  are  taken  into  account;  in 
ascertaining  the  contribution,  the  signs  are  ignored,  and  the  coefficients  are  ranked 
in  absolute  value.  (We  discuss  this  distinction  further  in  Section  8.7.1.)  We  are  more 
commonly  interested  in  assessing  the  contribution  of  the  variables  than  in  interpret¬ 
ing  the  function. 

In  the  next  three  sections,  we  cover  three  common  approaches  to  assessing  the 
contribution  of  each  variable  (in  the  presence  of  the  other  variables)  to  separating  the 
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groups.  The  three  methods  are  (1)  examine  the  standardized  discriminant  function 
coefficients,  (2)  calculate  a  partial  F-test  for  each  variable,  and  (3)  calculate  a  corre¬ 
lation  between  each  variable  and  the  discriminant  function.  The  third  method  is  the 
most  widely  recommended,  but  we  note  in  Section  8.7.3  that  it  is  the  least  useful. 


8.7.1  Standardized  Coefficients 

To  offset  differing  scales  among  the  variables,  the  discriminant  function  coefficients 
can  be  standardized  using  (8.16)  or  (8.17),  in  which  the  coefficients  have  been 
adjusted  so  that  they  apply  to  standardized  variables.  For  the  observations  in  the  first 
of  two  groups,  for  example,  we  have  by  (8.15), 


zu  =  al 


.  ym  -  Tit 

*1 

i  =  1,2,.. 


■,yu2  -  y  12 

S2 


+  •••  +  « 


yup  -yip 


,n  1. 


The  standardized  variables  (vi(>  —  yj r)/sr  are  scale  free,  and  the  standardized  coef¬ 
ficients  a*  =  s,  ar,  r  =  1,2,...,/?,  therefore  correctly  reflect  the  joint  contribu¬ 
tion  of  the  variables  to  the  discriminant  function  z  as  it  maximally  separates  the 
groups.  For  the  case  of  several  groups,  each  discriminant  function  coefficient  vector 
a  =  (ai .  «2, . . .  ,  ap)'  is  an  eigenvector  of  E-1H,  and  as  such,  it  takes  into  account 
the  sample  correlations  among  the  variables  as  well  as  the  influence  of  each  variable 
in  the  presence  of  the  others. 

As  noted  in  Section  8.5,  this  standardization  is  carried  out  for  each  of  the  s  dis¬ 
criminant  functions.  Typically,  each  will  have  a  different  interpretation;  that  is,  the 
pattern  of  the  coefficients  a*  will  vary  from  one  function  to  another. 

The  absolute  values  of  the  coefficients  can  be  used  to  rank  the  variables  in  order 
of  their  contribution  to  separating  the  groups.  If  we  wish  to  go  further  and  interpret 
or  “name”  a  discriminant  function,  the  signs  can  be  taken  into  account.  Thus,  for 
example,  z l  =  .8vi  —  .9y?  +  -5 >’3  has  a  different  meaning  than  does  Z2  =  .8yi  + 
,9)'2  +  .5  \'3,  since  z  1  depends  on  the  difference  between  y  1  and  y'2,  whereas  zi  is 
related  to  the  sum  of  y\  and  y?. 

The  discriminant  function  is  subject  to  the  same  limitations  as  other  linear  combi¬ 
nations  such  as  a  regression  equation:  for  example,  (1)  the  coefficient  for  a  variable 
may  change  notably  if  variables  are  added  or  deleted,  and  (2)  the  coefficients  may 
not  be  stable  from  sample  to  sample  unless  the  sample  size  is  large  relative  to  the 
number  of  variables.  With  regard  to  limitation  1,  we  note  that  the  coefficients  reflect 
the  contribution  of  each  variable  in  the  presence  of  the  particular  variables  at  hand. 
This  is,  in  fact,  what  we  want  the  coefficients  to  do.  As  to  limitation  2,  the  process¬ 
ing  of  a  substantial  number  of  variables  is  not  “free.”  More  stable  estimates  will  be 
obtained  from  50  observations  on  2  variables  than  from  50  observations  on  20  vari¬ 
ables.  In  other  words,  if  N/p  is  too  small,  the  variables  that  rank  high  in  one  sample 
may  emerge  as  less  important  in  another  sample. 
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8.7.2  Partial  F- Values 

For  any  variable  yr,  we  can  calculate  a  partial  /-’-test  showing  the  significance  of 
}-y  after  adjusting  for  the  other  variables,  that  is,  the  separation  provided  by  yr  in 
addition  to  that  due  to  the  other  variables.  After  computing  the  partial  F  for  each 
variable,  the  variables  can  then  be  ranked. 

In  the  case  of  two  groups,  the  partial  F  is  given  by  (5.32)  as 

2^2  _  j^2 

F  =  (u-p+l)  P  J~\  (8.26) 

V+Tp- 1 

where  T2  is  the  two-sample  Hotelling  T2  with  all  p  variables,  T2_l  is  the  T2 -statistic 
with  all  variables  except  yr,  and  v  =  n\  +  n,2  —  2.  The  /^-statistic  in  (8.26)  is  dis¬ 
tributed  as  F\  v_p+\. 

For  the  several-group  case,  the  partial  A  for  yr  adjusted  for  the  other  p  —  1  vari¬ 
ables  is  given  by  (6.128)  as 

An 

A(y,-|yi, . . .  ,  yr- 1,  yr+i, ...  ,yP)  =  - - ,  (8.27) 

Ap-\ 

where  Ap  is  Wilks’  A  for  all  p  variables  and  Ap-\  involves  all  variables  except  yr. 
The  corresponding  partial  F  is  given  by  (6.129)  as 


l-Avs-p+l 
A  vH 


(8.28) 


where  A  is  defined  in  (8.27),  ve  =  N  —  k,  and  vh  =  k  —  1.  The  partial  A-statistic 
in  (8.27)  is  distributed  as  AijVff,VB-p+l»  and  the  partial  F  in  (8.28)  is  distributed  as 
PvH,vE—p+ 1- 

The  partial  /-’-values  in  (8.26)  and  (8.28)  are  not  associated  with  a  single  dimen¬ 
sion  of  group  separation,  as  are  the  standardized  discriminant  function  coefficients. 
For  example,  yi  will  have  a  different  contribution  in  each  of  the  s  discriminant  func¬ 
tions,  but  the  partial  F  for  y'2  constitutes  an  overall  index  of  the  contribution  of  \’2  to 
group  separation  taking  into  account  all  dimensions.  However,  the  partial  /-’-values 
will  often  rank  the  variables  in  the  same  order  as  the  standardized  coefficients  for 
the  first  discriminant  function,  especially  if  M  /J2j  *-j  is  verY  large  so  that  the  first 
function  accounts  for  most  of  the  available  separation. 

A  partial  index  of  association  for  yr  similar  to  the  overall  measure  for  all  y  ’s  given 
in  (6.41),  rjy  =  1  —  A,  can  be  defined  by 


R2  =  1- Ar,  r  —  1,2,...  ,  p, 


(8.29) 


where  A,-  is  the  partial  A  in  (8.27)  for  yr.  This  partial  R2  is  a  measure  of  association 
between  the  grouping  variables  and  y,  after  adjusting  for  the  other  p  —  1  y’s. 
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8.7.3  Correlations  between  Variables  and  Discriminant  Functions 

Many  textbooks  and  research  papers  assert  that  the  best  measure  of  variable  impor¬ 
tance  is  the  correlation  between  each  variable  and  a  discriminant  function,  ryiZj .  It  is 
claimed  that  these  correlations  are  more  informative  than  standardized  coefficients 
with  respect  to  the  joint  contribution  of  the  variables  to  the  discriminant  functions. 
The  correlations  are  often  referred  to  as  loadings  or  structure  coefficients  and  are 
routinely  provided  in  many  major  programs.  However,  Rencher  (1988;  1992b;  1998, 
Section  5.7)  has  shown  that  the  correlations  in  question  show  the  contribution  of 
each  variable  in  a  univariate  context  rather  than  in  a  multivariate  one.  The  correla¬ 
tions  actually  reproduce  the  t  or  F  for  each  variable,  and  consequently  they  show 
only  how  each  variable  by  itself  separates  the  groups,  ignoring  the  presence  of  the 
other  variables.  Hence  these  correlations  provide  no  information  about  how  the  vari¬ 
ables  contribute  jointly  to  separation  of  the  groups.  They  become  misleading  if  used 
for  interpretation  of  discriminant  functions. 

Upon  reflection,  we  could  have  anticipated  this  failure  of  the  correlations  to  pro¬ 
vide  multivariate  information.  The  objection  to  standardized  coefficients  is  based 
on  the  argument  that  they  are  “unstable”  because  they  change  if  some  variables  are 
deleted  and  others  added.  However,  we  actually  want  them  to  behave  this  way,  so  as 
to  reflect  the  mutual  influence  of  the  variables  on  each  other.  In  a  multivariate  analy¬ 
sis,  interest  is  centered  on  the  joint  performance  of  the  set  of  variables  at  hand.  To  ask 
for  the  contribution  of  each  variable  independent  of  all  other  variables  is  to  request  a 
univariate  index  that  ignores  the  other  variables.  The  correlations  ryiZj  are  stable  and 
do  not  change  when  variables  are  added  or  deleted;  this  should  be  a  clear  signal  that 
they  are  univariate  in  nature.  There  is  no  middle  ground  between  the  univariate  and 
multivariate  realms. 


8.7.4  Rotation 

Rotation  of  the  discriminant  function  coefficients  is  sometimes  recommended.  This 
is  a  procedure  (see  Section  13.5)  that  attempts  to  produce  a  pattern  with  (absolute 
values  of)  coefficients  closer  to  0  or  1.  Discriminant  functions  with  such  coefficients 
might  be  easier  to  interpret,  but  they  have  two  deficiencies:  they  no  longer  maximize 
group  separation  and  they  are  correlated. 

Accordingly,  for  interpretation  of  discriminant  functions  we  recommend  stan¬ 
dardized  coefficients  rather  than  correlations  or  rotated  coefficients. 


8.8  SCATTER  PLOTS 

One  benefit  of  the  dimension  reduction  effected  by  discriminant  analysis  is  the  poten¬ 
tial  for  plotting.  It  was  noted  in  Section  6.2  that  the  number  of  large  eigenvalues  of 
E_1H  reflects  the  dimensionality  of  the  space  occupied  by  the  mean  vectors.  In  many 
data  sets,  the  first  two  discriminant  functions  account  for  most  of  A.i  +  A. 2  +  •  — h  Xs , 
and  consequently  the  pattern  of  the  mean  vectors  can  be  effectively  portrayed  in  a 
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two-dimensional  plot.  If  the  essential  dimensionality  is  greater  than  2,  there  may 
be  some  distortion  in  intergroup  configuration  in  a  two-dimensional  plot;  that  is, 
some  groups  that  overlap  in  two  dimensions  may  be  well  separated  in  a  third  dimen¬ 
sion. 

To  plot  the  first  two  discriminant  functions  for  the  individual  observation  vectors 
y ij,  simply  calculate  zuj  =  a'xy and  Z2ij  =  a^y ,7  for  i  —  1,  2 ,k\  j  =  1, 
2, . . .  ,  iij,  and  plot  a  scatter  plot  of  the  N  —  JT  n,  values  of 


z 


ij  ~ 


Zlij 

Z2ij 


at  yy 

a  ’2yij 


y  u  =  Ay  ij- 


(8.30) 


The  transformed  mean  vectors. 


Z; 


' 


1  1  y  /  =  Ay;, 


i  =  1,2, ...  ,k 


(8.31) 


should  be  plotted  along  with  the  individual  values,  z,y .  In  some  cases,  a  plot  would 
show  only  the  transformed  mean  vectors  zj,  z2,  ■  ■  •  ,  %k-  For  confidence  regions  for 
fiZi  =  A /a,  .  see  Rencher  (1998,  Section  5.8). 

We  note  that  the  eigenvalues  of  E-1H  reveal  the  dimensionality  of  the  mean  vec¬ 
tors,  not  of  the  individual  points.  The  dimensionality  of  the  individual  observations 
is  p,  although  the  essential  dimensionality  may  be  less  because  the  variables  are  cor¬ 
related.  (The  dimensionality  of  the  observation  vectors  is  the  concern  of  principal 
components;  see  Chapter  12.)  If  s  =  2,  for  example,  so  that  the  mean  vectors  occupy 
only  two  dimensions,  the  individual  observation  vectors  ordinarily  lie  in  more  than 
two  dimensions,  and  their  inclusion  in  a  plot  constitutes  a  projection  onto  the  two- 
dimensional  plane  of  the  mean  vectors. 

It  was  noted  in  Section  8.4.1  that  the  discriminant  functions  are  uncorrelated  but 
not  orthogonal.  Thus  the  angle  between  ai  and  a2  as  given  by  (3.14)  is  not  90° 
(that  is,  a,1a2  ^  0).  In  practice,  however,  the  usual  procedure  is  to  plot  discriminant 
functions  on  a  rectangular  coordinate  system.  The  resulting  distortion  is  generally 
not  serious. 


Example  8.8.  Figure  8.4  contains  a  scatter  plot  of  (z,\ ,  zi)  for  the  observations  in  the 
football  data  of  Table  8.3.  Each  observation  in  group  1  is  denoted  by  a  square,  obser¬ 
vations  in  group  2  are  denoted  by  circles,  and  observations  in  group  3  are  indicated 
by  triangles.  We  see  that  the  first  discriminant  function  z\  (the  horizontal  direction) 
effectively  separates  group  1  from  groups  2  and  3,  whereas  the  second  discrimi¬ 
nant  function  zi  (the  vertical  direction)  is  less  successful  in  separating  group  2  from 
group  3. 

The  group  mean  vectors  are  indicated  by  solid  circles.  They  are  almost  collinear, 
as  we  would  expect  since  A.i  =  1.92  dominates  'a. 2  =  .12.  □ 
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Figure  8.4.  Scatter  plot  of  discriminant  function  values  for  the  football  data  of  Table  8.3. 


8.9  STEPWISE  SELECTION  OF  VARIABLES 

In  many  applications,  a  large  number  of  dependent  variables  is  available  and  the 
experimenter  would  like  to  discard  those  that  are  redundant  (in  the  presence  of  the 
other  variables)  for  separating  the  groups.  Our  discussion  is  limited  to  procedures 
that  delete  or  add  variables  one  at  a  time.  We  emphasize  that  we  are  selecting  depen¬ 
dent  variables  (y’s),  and  therefore  the  basic  model  (one-way  MANOVA)  does  not 
change.  In  subset  selection  in  regression,  on  the  other  hand,  we  select  independent 
variables  with  a  consequent  alteration  of  the  model. 

K  forward  selection  method  was  discussed  in  Section  6.11.2.  We  begin  with  a  sin¬ 
gle  variable,  the  one  that  maximally  separates  the  groups  by  itself.  Then  the  variable 
entered  at  each  step  is  the  one  that  maximizes  the  partial  F-statistic  based  on  Wilks’ 
A,  thus  obtaining  the  maximal  additional  separation  of  groups  above  and  beyond 
the  separation  already  attained  by  the  other  variables.  Since  we  choose  the  variable 
with  maximum  partial  F  at  each  step,  the  proportion  of  these  maximum  F’s  that 
exceed  Fa  is  greater  than  a.  This  bias  is  discussed  in  Rencher  and  Larson  (1980)  and 
Rencher  (1998,  Section  5.10). 

Backward  elimination  is  a  similar  operation  in  which  we  begin  with  all  the  vari¬ 
ables  and  then  at  each  step,  the  variable  that  contributes  least  is  deleted,  as  indicated 
by  the  partial  F . 

Stepwise  selection  is  a  combination  of  the  forward  and  backward  approaches. 
Variables  are  added  one  at  a  time,  and  at  each  step,  the  variables  are  reexamined 
to  see  if  any  variable  that  entered  earlier  has  become  redundant  in  the  presence  of 
recently  added  variables.  The  procedure  stops  when  the  largest  partial  F  among  the 
variables  available  for  entry  fails  to  exceed  a  preset  threshold  value.  The  stepwise 
procedure  has  long  been  popular  with  practitioners.  Some  detail  about  the  steps  in 
this  procedure  was  given  in  Section  6.1 1.2. 


294 


DISCRIMINANT  ANALYSIS:  DESCRIPTION  OF  GROUP  SEPARATION 


All  the  preceding  procedures  are  commonly  referred  to  as  stepwise  discrimi¬ 
nant  analysis.  However,  as  noted  in  Section  6. 1 1 .2,  we  are  actually  doing  stepwise 
MANOVA.  No  discriminant  functions  are  calculated  in  the  selection  process.  After 
the  subset  selection  is  completed,  we  can  calculate  discriminant  functions  for  the 
selected  variables.  We  could  also  use  the  variables  in  a  classification  analysis,  as 
described  in  Chapter  9. 


Example  8.9.  We  use  the  football  data  of  Table  8.3  to  illustrate  the  stepwise  proce¬ 
dure  outlined  in  this  section  and  in  Section  6.11.2.  At  the  first  step,  we  carry  out  a 
univariate  F  (using  ordinary  ANOVA)  for  each  variable  to  determine  which  variable 
best  separates  the  three  groups  by  itself: 


Variable 

F 

p- Value 

WDIM 

2.550 

.0839 

CIRCUM 

6.231 

.0030 

FBEYE 

1.668 

.1947 

EYEHD 

58.162 

1.11  x  10-16 

EARHD 

22.427 

1.40  x  10~8 

JAW 

4.511 

.0137 

Thus  EYEHD  is  the  first  variable  to  “enter.”  The  Wilks  A  value  equivalent  to  F  = 
58.162  is  A(yi)  =  .4279  (see  Table  6.1  with  p  —  1).  At  the  second  step  we  calculate 
a  partial  A  and  accompanying  partial  F  using  (8.27)  and  (8.28): 


AOvlyO  = 


A(yi,  yr) 
A(yi) 


1  -  A(yr|yi)  vE  -  1 
Af.Vrlyi)  vH 


where  yi  indicates  the  variable  selected  at  step  1  (EYEHD)  and  yr  represents  each 
of  the  five  variables  to  be  examined  at  step  2.  The  results  are 


Variable 

Partial  A 

Partial  F 

p-Value 

WDIM 

.9355 

2.964 

.0569 

CIRCUM 

.9997 

.012 

.9881 

FBEYE 

.9946 

.235 

.7911 

EARHD 

.9525 

2.143 

.1235 

JAW 

.9540 

2.072 

.1322 

The  variable  WDIM  would  enter  at  this  step,  since  it  has  the  largest  partial  F.  With  a 
p-value  of  .0569,  entering  this  variable  may  be  questionable,  but  we  will  continue  the 
procedure  for  illustrative  purposes.  We  next  check  to  see  if  EYEHD  is  still  significant 
now  that  WDIM  has  entered.  The  partial  A  and  F  for  EYEHD  adjusted  for  WDIM 
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are  A  =  .424  and  F  =  58.47.  Thus  EYEHD  stays  in.  The  overall  Wilks’  A  for 
EYEHD  and  WDIM  is  A  Oh,  y2)  =  .4003. 

At  step  3  we  check  each  of  the  four  remaining  variables  for  possible  entry  using 


AOvlyi,  yi) 


AQu  y2,  yr) 
A  (yi,  yi) 


1  -  AQy|yi,  y2)  vg  -  2 
A(yr|yi,y2)  vH 


where  yi  =  EYEHD,  yi  —  WDIM,  and  yr  represents  each  of  the  other  four  vari¬ 
ables.  The  results  are 


Variable 

Partial  A 

Partial  F 

p  -Value 

CIRCUM 

.9774 

.981 

.3793 

FBEYE 

.9748 

1.098 

.3381 

EARHD 

.9292 

3.239 

.0441 

JAW 

.8451 

7.791 

.0008 

The  indicated  variable  for  entry  at  this  step  is  JAW.  To  determine  whether  one  of  the 
first  two  should  be  removed  after  JAW  has  entered,  we  calculate  the  partial  A  and  F 
for  each,  adjusted  for  the  other  two: 


Variable  Partial  A  Partial  F  p- Value 

WDIM  .8287  8.787  .0003 

EYEHD  .4634  49.211  6.33  x  10~15 


Thus  both  previously  entered  variables  remain  in  the  model.  The  overall  Wilks  A  for 
EYEHD,  WDIM,  and  JAW  is  A0h,  y2,  >’3)  =  .3383. 

At  step  4  there  are  three  candidate  variables  for  entry.  The  partial  A-  and  F- 
statistics  are 


A0vl)h>  yi,  W)  = 

F  = 


A  Oh,  yi,  B,  yr) 

A0h,  yi,  yi) 

1  -  A(yr|yi,  y2,  B)  VE  ~  3 
A(yr|yi,  yi,  y^)  vH 


where  vi ,  yi,  and  y3  are  the  three  variables  already  entered  and  yr  represents  each  of 
the  other  three  remaining  variables.  The  results  are 


Variable  Partial  A  Partial  F  p- Value 

CIRCUM  .9987  .055  .9462 

FBEYE  .9955  .189  .8282 

EARHD  .9080  4.257  .0173 
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Hence  EARHD  enters  at  this  step,  and  we  check  to  see  if  any  of  the  three  previ¬ 
ously  entered  variables  has  now  become  redundant.  The  partial  A  and  partial  F  for 
each  of  these  three  are 

Variable  Partial  A  Partial  F  p-Value 

WDIM  .7889  11.237  4.74  x  10“15 

EYEHD  .6719  20.508  5.59  x  10~8 

JAW  .8258  8.861  .0003 

Consequently,  all  three  variables  are  retained.  The  overall  Wilks’  A  for  all  four 
variables  is  now  A(yi,  yi,  >’3,  y 4)  =  .3072. 

At  step  5,  the  partial  A-  and  /-’-values  are 


Variable  Partial  A  Partial  F  p-Value 

CIRCUM  .9999  .003  .9971 

FBEYE  .9999  .004  .9965 

Thus  no  more  variables  will  enter. 

We  summarize  the  selection  process  as  follows: 


Step 

Variable 

Entered 

Overall  A 

Partial  A 

Partial  F 

p-Value 

1 

EYEHD 

.4279 

.4279 

58.162 

1.11  x  10^16 

2 

WDIM 

.4003 

.9355 

2.964 

.0569 

3 

JAW 

.3383 

.8451 

7.791 

.0008 

4 

EARHD 

.3072 

.9080 

4.257 

.0173 

PROBLEMS 

8.1  Show  that  if  a  =  S^1  (y ,  —  y2)  is  substituted  into  [a'(y ,  —  y2)]2/a'Spia,  the 
result  is  (8.3). 

8.2  Verify  (8.4)  for  the  relationship  between  b  and  a. 

8.3  Verify  the  relationship  between  R 2  and  T2  shown  in  (8.5). 

8.4  Show  that  [a'(yx  -  y2)]2  =  a'(y,  -  y2)(y,  -  y2)'a  as  in  (8.7). 

8.5  Show  that  Ha  —  kEa  =  0  can  be  written  in  the  form  ( E  ~~ 1 H  —  A.  I)  a  =  0,  as  in 
(8.12). 

8.6  Verify  (8.16)  by  substituting  a*  =  srar  into  (8.15)  to  obtain  zu  =  fliyi/i  + 

aiym  H - \-aPyuP  -  a'yx. 

8.7  For  the  psychological  data  in  Table  5.1,  the  discriminant  function  coefficient 
vector  was  given  in  Example  5.5. 
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(a)  Find  the  standardized  coefficients. 

(b)  Calculate  f-tests  for  the  individual  variables. 

(c)  Compare  the  results  of  (a)  and  (b)  as  to  the  contribution  of  the  variables  to 
separation  of  the  two  groups. 

(d)  Find  the  partial  F  for  each  variable,  as  in  (8.26),  and  compare  with  the 
standardized  coefficients. 

8.8  Using  the  beetle  data  of  Table  5.5,  do  the  following: 

(a)  Find  the  discriminant  function  coefficient  vector. 

(b)  Find  the  standardized  coefficients. 

(c)  Calculate  f-tests  for  individual  variables. 

(d)  Compare  the  results  of  (b)  and  (c)  as  to  the  contribution  of  each  variable  to 
separation  of  the  groups. 

(e)  Find  the  partial  F  for  each  variable,  as  in  (8.26).  Do  the  partial  F’s  rank  the 
variables  in  the  same  order  of  importance  as  the  standardized  coefficients? 

8.9  Using  the  dystrophy  data  of  Table  5.7,  do  the  following: 

(a)  Find  the  discriminant  function  coefficient  vector. 

(b)  Find  the  standardized  coefficients. 

(c)  Calculate  f-tests  for  individual  variables. 

(d)  Compare  the  results  of  (b)  and  (c)  as  to  the  contribution  of  each  variable  to 
separation  of  the  groups. 

(e)  Find  the  partial  F  for  each  variable,  as  in  (8.26).  Do  the  partial  F’s  rank  the 
variables  in  the  same  order  of  importance  as  the  standardized  coefficients? 

8.10  For  the  cyclical  data  of  Table  5.8,  do  the  following: 

(a)  Find  the  discriminant  function  coefficient  vector. 

(b)  Find  the  standardized  coefficients. 

(c)  Calculate  f-tests  for  individual  variables. 

(d)  Compare  the  results  of  (b)  and  (c)  as  to  the  contribution  of  each  variable  to 
separation  of  the  groups. 

(e)  Find  the  partial  F  for  each  variable,  as  in  (8.26).  Do  the  partial  F’s  rank  the 
variables  in  the  same  order  of  importance  as  the  standardized  coefficients? 

8.11  Using  the  fish  data  in  Table  6.17,  do  the  following: 

(a)  Find  the  eigenvectors  of  E_1H. 

(b)  Carry  out  tests  of  significance  for  the  discriminant  functions  and  find  the 
relative  importance  of  each  as  in  (8.13),  X,  /  JU  Xj.  Do  these  two  proce¬ 
dures  agree  as  to  the  number  of  important  discriminant  functions? 

(c)  Find  the  standardized  coefficients  and  comment  on  the  contribution  of  the 
variables  to  separation  of  groups. 

(d)  Find  the  partial  F  for  each  variable,  as  in  (8.28).  Do  they  rank  the  variables 
in  the  same  order  as  the  standardized  coefficients  for  the  first  discriminant 
function? 
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(e)  Plot  the  first  two  discriminant  functions  for  each  observation  and  for  the 
mean  vectors. 

8.12  For  the  rootstock  data  of  Table  6.2,  do  the  following: 

(a)  Find  the  eigenvalues  and  eigenvectors  of  E-1H. 

(b)  Carry  out  tests  of  significance  for  the  discriminant  functions  and  find  the 
relative  importance  of  each  as  in  (8.13),  A.,-/  JT  /,/.  Do  these  two  proce¬ 
dures  agree  as  to  the  number  of  important  discriminant  functions? 

(c)  Find  the  standardized  coefficients  and  comment  on  the  contribution  of  the 
variables  to  separation  of  groups. 

(d)  Find  the  partial  F  for  each  variable,  as  in  (8.28).  Do  they  rank  the  variables 
in  the  same  order  as  the  standardized  coefficients  for  the  first  discriminant 
function? 

(e)  Plot  the  hist  two  discriminant  functions  for  each  observation  and  for  the 
mean  vectors. 

8.13  Carry  out  a  stepwise  selection  of  variables  on  the  rootstock  data  of  Table  6.2. 

8.14  Carry  out  a  stepwise  selection  of  variables  on  the  engineer  data  of  Table  5.6. 

8.15  Carry  out  a  stepwise  selection  of  variables  on  the  fish  data  of  Table  6.17. 


CHAPTER  9 


Classification  Analysis:  Allocation  of 
Observations  to  Groups 


9.1  INTRODUCTION 

The  descriptive  aspect  of  discriminant  analysis,  in  which  group  separation  is  charac¬ 
terized  by  means  of  discriminant  functions,  was  covered  in  Chapter  8.  We  turn  now 
to  allocation  of  observations  to  groups,  which  is  the  predictive  aspect  of  discriminant 
analysis.  We  prefer  to  call  this  classification  analysis  to  clearly  distinguish  it  from  the 
descriptive  aspect.  However,  classification  is  often  referred  to  simply  as  discriminant 
analysis.  In  engineering  and  computer  science,  classification  is  usually  called  pat¬ 
tern  recognition.  Some  writers  use  the  term  classification  analysis  to  describe  cluster 
analysis ,  in  which  the  observations  are  clustered  according  to  variable  values  rather 
than  into  predefined  groups  (see  Chapter  14). 

In  classification,  a  sampling  unit  (subject  or  object)  whose  group  membership  is 
unknown  is  assigned  to  a  group  on  the  basis  of  the  vector  of  p  measured  values,  y, 
associated  with  the  unit.  To  classify  the  unit,  we  must  have  available  a  previously 
obtained  sample  of  observation  vectors  from  each  group.  Then  one  approach  is  to 
compare  y  with  the  mean  vectors  yj,  y2, . . .  .  yk  of  the  k  samples  and  assign  the  unit 
to  the  group  whose  y,  is  closest  to  y. 

In  this  chapter,  the  term  groups  may  refer  to  either  the  k  samples  or  the  k  popu¬ 
lations  from  which  they  were  taken.  It  should  be  clear  from  the  context  which  of  the 
two  uses  is  intended  in  every  case. 

We  give  some  examples  to  illustrate  the  classification  technique: 

1.  A  university  admissions  committee  wants  to  classify  applicants  as  likely  to 
succeed  or  likely  to  fail.  The  variables  available  are  high  school  grades  in  var¬ 
ious  subject  areas,  standardized  test  scores,  rating  of  high  school,  number  of 
advanced  placement  courses,  etc. 

2.  A  psychiatrist  gives  a  battery  of  diagnostic  tests  in  order  to  assign  a  patient  to 
the  appropriate  mental  illness  category. 

3.  A  college  student  takes  aptitude  and  interest  tests  in  order  to  determine  which 
vocational  area  his  or  her  profile  best  matches. 
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4.  African,  or  “killer,”  bees  cannot  be  distinguished  visually  from  ordinary 
domestic  honey  bees.  Ten  variables  based  on  chromatograph  peaks  can  be 
used  to  readily  identify  them  (Lavine  and  Carlson  1987). 

5.  The  Air  Force  wishes  to  classify  each  applicant  into  the  training  program 
where  he  or  she  has  the  most  potential. 

6.  Twelve  of  the  Federalist  Papers  were  claimed  by  both  Madison  and  Hamilton. 
Can  we  identify  authorship  by  measuring  frequencies  of  word  usage  (Mosteller 
and  Wallace  1984)? 

7.  Variables  such  as  availability  of  fingerprints,  availability  of  eye  witnesses,  and 
time  until  police  arrive  can  be  used  to  classify  burglaries  into  solvable  and 
unsolvable. 

8.  One  approach  to  speech  recognition  by  computer  consists  of  an  attempt  to 
identify  phonemes  based  on  the  energy  levels  in  speech  waves. 

9.  A  number  of  variables  are  measured  at  five  weather  stations.  Based  on  these 
variables,  we  wish  to  predict  the  ceiling  at  a  particular  airport  in  2  hours.  The 
ceiling  categories  are  closed,  low  instrument,  high  instrument,  low  open,  and 
high  open  (Lachenbruch  1975,  p.  2). 


9.2  CLASSIFICATION  INTO  TWO  GROUPS 

In  the  case  of  two  populations,  we  have  a  sampling  unit  (subject  or  object)  to  be 
classified  into  one  of  two  populations.  The  information  we  have  available  consists  of 
the  p  variables  in  the  observation  vector  y  measured  on  the  sampling  unit.  In  the  first 
illustration  in  Section  9.1,  for  example,  we  have  an  applicant  with  high  school  grades 
and  various  test  scores  recorded  in  y.  We  do  not  know  if  the  applicant  will  succeed 
or  fail  at  the  university,  but  we  have  data  on  previous  students  at  the  university  for 
whom  it  is  now  known  whether  they  succeeded  or  failed.  By  comparing  y  with  y  j  for 
those  who  succeeded  and  y2  for  those  who  failed,  we  attempt  to  predict  the  group  to 
which  the  applicant  will  eventually  belong. 

When  there  are  two  populations,  we  can  use  a  classification  procedure  due  to 
Fisher  (1936).  The  principal  assumption  for  Fisher’s  procedure  is  that  the  two  popu¬ 
lations  have  the  same  covariance  matrix  (Si  =  X2).  Normality  is  not  required.  We 
obtain  a  sample  from  each  of  the  two  populations  and  compute  jq,  y2,  and  Spi.  A 
simple  procedure  for  classification  can  be  based  on  the  discriminant  function, 

z  =  a'y  =  (yi  -  y2)  Vy  (9A) 

(see  Sections  5.5,  5.6,  8.2,  and  8.5),  where  y  is  the  vector  of  measurements  on  a  new 
sampling  unit  that  we  wish  to  classify  into  one  of  the  two  groups  (populations).  For 
convenience  we  speak  of  classifying  y  rather  than  classifying  the  subject  or  object 
associated  with  y. 

To  determine  whether  y  is  closer  to  yj  or  y2,  we  check  to  see  if  z  in  (9.1)  is 
closer  to  the  transformed  mean  z  1  or  to  Z2-  We  evaluate  (9.1)  for  each  observation 
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y  1/  from  the  first  sample  and  obtain  zn,  Z12,  ■  ■  ■  ,  z ini,  from  which,  by  (3.54),  z  1  = 
YTiLiZu/ni  —  a'yx  =  (y  1  -  y^'S^y,.  Similarly,  z2  =  a'y2.  Denote  the  two 
groups  by  G  \  and  G2.  Fisher’s  (1936)  linear  classification  procedure  assigns  y  to 
Gi  if  z  =  a'y  is  closer  to  z  1  than  to  Z2  and  assigns  y  to  G 2  if  z  is  closer  to  Z2-  This  is 
illustrated  in  Figure  9.1. 

For  the  configuration  in  Figure  9.1,  we  see  that  z  is  closer  to  z\  if 

z>\(zi+Z2)-  (9.2) 

This  is  true  in  general  because  z  1  is  always  greater  than  zi>  which  can  easily  be 
shown  as  follows: 

zi-Z2  =  a'(yt  -  y2)  =  (yx  -  y2)'S“11  (yx  -  y2)  >  0,  (9.3) 

because  1  is  positive  definite.  Thus  zi  >  Z2-  [If  a  were  of  the  form  a'  =  (y2  — 
yx)'Spj,  then  Z2  —  z\  would  be  positive.]  Since  ^(Tj  +  z2)  is  the  midpoint,  z  > 

?(zi  +  Z2)  implies  that  z  is  closer  to  zi-  By  (9.3)  the  distance  from  zi  to  z2  is  the 
same  as  that  from  yx  to  y0 . 

To  express  the  classification  rule  in  terms  of  y,  we  first  write  4(Ti  +  Z2)  in  the 
form 


l(zt  +  Z2)  =  y(yi  -  y2),Spl1  (yx  +  y2).  (9.4) 

Then  the  classification  rule  becomes:  Assign  y  to  Gi  if 

a'y  =  (yi  -  y2),S“11y  >  364  -  y2),S“11(yi  +  y2)  (9.5) 


Figure  9.1.  Fisher’s  procedure  for  classification  into  two  groups. 
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and  assign  y  to  G 2  if 

ay  =  (yi  -  y2),Sp11y  <  j(yi  -  y2)'Splx (y,  +  y2).  (9.6) 

This  linear  classification  rule  employs  the  same  discriminant  function  z  —  a'y 
used  in  Section  8.2  in  connection  with  descriptive  separation  of  groups.  Thus  in  the 
two-group  case,  the  discriminant  function  serves  as  a  linear  classification  function  as 
well.  However,  in  the  several-group  case  in  Section  9.3,  we  use  classification  func¬ 
tions  that  are  different  from  the  descriptive  discriminant  functions  in  Section  8.4. 

Fisher’s  (1936)  approach  using  (9.5)  and  (9.6)  is  essentially  nonparametric 
because  no  distributional  assumptions  were  made.  However,  if  the  two  populations 
are  normal  with  equal  covariance  matrices,  then  this  method  is  (asymptotically) 
optimal;  that  is,  the  probability  of  misclassification  is  minimized  [see  comments 
following  (9.8)]. 

If  prior  probabilities  p\  and  p2  are  known  for  the  two  populations,  the  classifica¬ 
tion  rule  can  be  modified  to  exploit  this  additional  information.  We  define  the  prior 
probabilities  as  follows:  p\  is  the  proportion  of  observations  in  G  \  and  p2  is  the  pro¬ 
portion  in  G2,  where  p2  —  1  —  p\.  For  example,  suppose  that  at  a  certain  university 
70%  of  entering  freshmen  ultimately  graduate.  Then  p\  —  .7  and  pi  =  .3. 

In  order  to  use  the  prior  probabilities,  the  density  functions  for  the  two  popula¬ 
tions,  /(y|Gi)  and  /(y|G2),  must  also  be  known.  Then  the  optimal  classification 
rule  (Welch  1939)  that  minimizes  the  probability  of  misclassification  is:  Assign  y  to 
Gi  if 


Fi/(y|Gi)  >  pif  (y|G2)  (9.7) 

and  to  G2  otherwise.  Note  that  /(y|Gi)  is  a  convenient  notation  for  the  density  when 
sampling  from  the  population  represented  by  G  i .  It  does  not  represent  a  conditional 
distribution  in  the  usual  sense  (Section  4.2). 

Assuming  that  the  two  densities  are  multivariate  normal  with  equal  covariance 
matrices,  namely,  /(y|Gi)  =  Np(/Ji i,X)  and  /(y|G2)  =  Np(/jl2,X),  then  from 
(9.7)  we  obtain  the  following  rule  (with  estimates  in  place  of  /jl\ ,  fjL2,  and  X):  Assign 
y  to  G\  if 


a'y  =  (yi  -  y2)'spl1y  >  {(y\  -  y2)'spl1(y1  +  y2)  +  in  (^j  (9.8) 

and  to  G2  otherwise  [see  Rencher  (1998,  p.  231)].  Because  we  have  substituted  esti¬ 
mates  for  the  parameters,  the  rule  in  (9.8)  is  no  longer  optimal,  as  is  (9.7).  However, 
it  is  asymptotically  optimal  (approaches  optimality  as  the  sample  size  increases). 

If  pi  —  p2,  the  normal-based  classification  rule  in  (9.8)  becomes  the  same  as 
Fisher’s  procedure  given  in  (9.5)  and  (9.6).  Thus  Fisher’s  rule,  which  is  not  based  on 
a  normality  assumption,  has  optimal  properties  when  the  data  come  from  multivari¬ 
ate  normal  populations  with  Xi  =  X2  and  pi  =  p2-  [For  the  case  when  Xi  X2, 
see  Rencher  (1998,  Section  6.2.2).]  Hence,  even  though  Fisher’s  method  is  nonpara¬ 
metric,  it  works  better  for  normally  distributed  populations  or  other  populations  with 
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Figure  9.2.  Two  populations  with  nonlinear  separation. 


linear  trends.  For  example,  suppose  two  populations  have  95%  contours,  as  in  Fig¬ 
ure  9.2.  If  the  points  are  projected  in  any  direction  onto  a  straight  line,  there  will  be 
almost  total  overlap.  A  linear  discriminant  procedure  will  not  successfully  separate 
the  two  populations. 


Example  9.2.  For  the  psychological  data  of  Table  5.1,  yx,  y2,  and  Spi  were  obtained 
in  Example  5.4.2.  The  discriminant  function  coefficients  were  obtainedin  Exam¬ 
ple  5.5  as  a!  —  (.5104,  —.2032,  .4660,  —.3097).  For  G i  (the  male  group),  we  find 

zi  =  a'yx  =  .5104(15.97)  -  .2032(15.91)  +  .4660(27.19)  -  .3097(22.75) 

=  10.5427. 

Similarly,  for  Gi  (the  female  group),  z.2  =  a'y2  =  4.4426.  Thus  we  assign  an  obser¬ 
vation  vector  y  to  G  \  if 


z  =  a'y  >  \{z\  +z2)  =  7.4927 


and  assign  y  to  G2  if  z  <  7.4927. 

There  are  no  new  observations  available,  so  we  will  illustrate  the  procedure 
by  classifying  two  of  the  observations  in  G\.  For  yj  (  =  (15,  17,  24,  14),  the  first 
observation  in  G\,  we  have  zn  —  a'yn  =  .5104(15)  —  .2032(17)  +  .4660(24)  — 
.3097(14)  =  11.0498,  which  is  greater  than  7.4927,  and  y\\  would  be  correctly 
classified  as  belonging  to  Gx.  For  y'|4  =  (13,  12,  10,  16),  the  fourth  observation  in 
G i,  we  find  z\ 4  —  3.9016,  which  would  misclassify  y 1 4  into  G2.  □ 
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9.3  CLASSIFICATION  INTO  SEVERAL  GROUPS 

In  this  section  we  discuss  classification  rules  for  several  groups.  As  in  the  two-group 
case,  we  use  a  sample  from  each  of  the  k  groups  to  find  the  sample  mean  vectors  y  j , 
y2, . . .  ,  y For  a  vector  y  whose  group  membership  is  unknown,  one  approach  is  to 
use  a  distance  function  to  find  the  mean  vector  that  y  is  closest  to  and  assign  y  to  the 
corresponding  group. 

9.3.1  Equal  Population  Covariance  Matrices:  Linear  Classification  Functions 

In  this  section  we  assume  Xi  =  %2  =  •  •  •  =  Xk-  We  can  estimate  the  common 
population  covariance  matrix  by  a  pooled  sample  covariance  matrix 

SPi  =  . 7 1  ,  y "'(»!'  -  1)S;  =  E  , 

F  N  —  k  N  -  k 

where  and  S,  are  the  sample  size  and  covariance  matrix  of  the  ;th  group,  E  is  the 
error  matrix  from  one-way  MAN OVA,  and  N  —  We  compare  y  to  each  y 

i  =  1,2,...  ,  k,  by  the  distance  function 

Df(  y)  =  (y-y/)V(y-y/)  (9-9) 

and  assign  y  to  the  group  for  which  Df(y)  is  smallest. 

We  can  obtain  a  linear  classification  rule  by  expanding  (9.9): 

Di(y)  =  y'Spi'y  -  y'Sp.'y,  -  yJSj^y  +  y-Sj^y,- 
=  y'Spi'y  -  Zy'Sp/y  +  y'Sp,1^. 

The  term  y'S^1  y  on  the  right  can  be  neglected  since  it  is  not  a  function  of  i  and, 
consequently,  does  not  change  from  group  to  group.  The  second  term  is  a  linear 
function  of  y,  and  the  third  does  not  involve  y.  We  thus  delete  y'S^  y  and  obtain  a 
linear  classification  function,  which  we  denote  by  L,  ( y).  If  we  multiply  by  —  ^  to 
agree  with  the  rule  based  on  the  normal  distribution  and  prior  probabilities  given  in 
(9.12),  our  linear  classification  rule  becomes:  Assign  y  to  the  group  for  which 

Lit y)  =  y^Spi'y  -  i  =  1, 2, . . . ,  *  (9.10) 

is  a  maximum  (we  reversed  the  sign  when  multiplying  by  —  j).  To  highlight  the 
linearity  of  (9. 10)  as  a  function  of  y,  we  can  express  it  as 

Li( y)  =  c-y  +  c/o  =  cnyi  +  cnyi  H - h  cipyp  +  ci0, 

where  c'-  =  y '•  S^1  and  c,o  =  —  ^y^ SIJ 1  y .  To  assign  y  to  a  group  using  this  procedure, 
we  calculate  c,-  and  c,o  for  each  of  the  k  groups,  evaluate  L;  (y),  i  —  1,2,...  ,  k,  and 
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allocate  y  to  the  group  for  which  L ,  (y)  is  largest.  This  will  be  the  same  group  for 
which  Dj( y)  in  (9.9)  is  smallest,  that  is,  the  group  whose  mean  vector  y,  is  closest 
to  y. 

For  the  case  of  several  groups,  the  optimal  rule  in  (9.7)  extends  to: 

Assign  y  to  the  group  for  which  Pif(y\Gj)  is  maximum.  (9.11) 

With  this  rule,  the  probability  of  misclassification  is  minimized.  If  we  assume  nor¬ 
mality  with  equal  covariance  matrices  and  with  prior  probabilities  of  group  member¬ 
ship,  p i,  p2,  . . .  ,  Pk,  then  /(y|G()  =  Np(fii,  X),  and  the  rule  in  (9.11)  becomes 
(with  estimates  in  place  of  parameters):  Calculate 

c;.(y)  =  In  Pi  d-y-Sp/y—  jy-Sp/y,-,  i  =  l,2,...,k  (9.12) 

and  assign  y  to  the  group  with  maximum  value  of  L'fi y).  Note  that  if  p\  =  P2  = 
■  ■  ■  —  pi<,  then  (9.12),  which  optimizes  the  classification  rate  for  the  normal  distri¬ 
bution,  reduces  to  (9.10),  which  was  based  on  the  heuristic  approach  of  minimizing 
the  distance  of  y  to  y ,• . 

The  linear  functions  L,  ( y)  defined  in  (9.10)  are  called  linear  classification  func¬ 
tions  (many  writers  refer  to  them  as  linear  discriminant  functions).  They  are  different 
from  the  linear  discriminant  functions  in  Sections  6.1.4,  6.4,  and  8.4.1,  whose  coef¬ 
ficients  are  eigenvectors  of  E- 1 H.  In  fact,  there  will  be  k  classification  functions  and 
s  —  mini  p.  k  —  1)  discriminant  functions,  where  k  is  the  number  of  groups  and  p  is 
the  number  of  variables.  In  many  cases  we  do  not  need  all  s  discriminant  functions 
to  effectively  describe  group  differences,  whereas  all  k  classification  functions  must 
be  used  in  assigning  observations  to  groups. 

Example  9.3.1.  For  the  football  data  of  Table  8.3,  the  mean  vectors  for  the  three 
groups  are  as  follows: 

y'j  =  (15.2,58.9,20.1,  13.1,  14.7,  12.3), 
y'2  =  (15.4,57.4,  19.8,  10.1,  13.5,  11.9), 
y'3  =  (15.6,57.8,  19.8,  10.9,  13.7,  11.8). 

Using  these  values  of  y,  and  the  pooled  covariance  matrix  Spi,  given  in  Example  8.5, 
the  linear  classification  functions  (9.10)  become 

L]  (y)  =  7.6yi  +  13.3y2  +  4.2v3  -  1.2y4  +  14.6 ys  +  8.2y6  -  641.1, 

^2(y)  =  10.2yj  +  13.3y2  +  4.2y3  -  3.4y4  +  13.2ys  +  6.1y6  -  608.0, 

L3(y)  =  10.9yi  +  13.3y2  +  4.1y3  -  2.7y4  +  13.1V5  +  5.2y6  -  614.6. 

We  note  that  y2  and  >’3  have  essentially  the  same  coefficients  in  all  three  functions 
and  hence  do  not  contribute  to  classification  of  y.  These  same  two  variables  were 
eliminated  in  the  stepwise  discriminant  analysis  in  Example  8.9. 
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We  illustrate  the  use  of  these  linear  functions  for  the  first  and  third  observations 
in  group  1.  For  the  first  observation,  yn,  we  obtain 

Lx  (yn)  =  7.6(13.5)  +  13.3(57.2)  +  4.2(19.5)  -  1.2(12.5)  +  14.6(14.0) 

+  8.2(11.0) -  641.1  =  582.124, 

L2( yn)  =  10.2(13.5)  +  13.3(57.2)  +  4.2(19.5)  -  3.4(12.5)  +  13.2(14.0) 

+  6.1(11.0) -  608.0  =  578.099, 

L3( yn)  =  10.9(13.5)  +  13.3(57.2)  +  4.1(19.5)  -  2.7(12.5)  +  13.1(14.0) 

+  5.2(11.0)  -  614.6  =  578.760. 

We  classify  yn  into  group  1  since  Li(yn)  =  582.1  exceeds  L2(yn)  and  L3(yn). 
For  the  third  observation  in  group  1,  v  ]  3 ,  we  obtain 

ii(yi3)  =  567.054,  i2(y13)  =  570.290,  L3(y13)  =  569.137. 

This  observation  is  misclassified  into  group  2  since  L2(y |3)  =  570.290  exceeds 
ii(yi3)  and  L3(yi3).  □ 

9.3.2  Unequal  Population  Covariance  Matrices:  Quadratic 
Classification  Functions 

The  linear  classification  functions  in  Section  9.3.1  are  based  on  the  assumption 
X\  —  Xi  =  •  •  •  =  Xk-  The  resulting  classification  rules  are  sensitive  to  hetero¬ 
geneity  of  covariance  matrices.  Observations  tend  to  be  classified  too  frequently  into 
groups  whose  covariance  matrices  have  larger  variances  on  the  diagonal.  Thus  the 
population  covariance  matrices  should  not  be  assumed  to  be  equal  if  there  is  reason 
to  suspect  otherwise. 

If  Si  =  X2  =  ■  ■  ■  —  Xk  does  not  hold,  the  classification  rules  can  easily  be 
altered  to  preserve  optimality  of  classification  rates.  In  place  of  (9.9),  we  can  use 

Df(  y)  =  (y  -  y.oVfcr  -  y;),  i  =  l,2,...,k,  (9.13) 

where  S,  is  the  sample  covariance  matrix  for  the  ith  group.  As  before,  we  would 
assign  y  to  the  group  for  which  Z)?( y)  is  smallest.  With  S;  in  place  of  Spi,  (9.13)  can¬ 
not  be  reduced  to  a  linear  function  of  y  as  in  (9.10)  but  remains  a  quadratic  function. 
Hence  rules  based  on  S;  are  called  quadratic  classification  rules. 

If  we  assume  normality  with  unequal  covariance  matrices  and  with  prior  proba¬ 
bilities  pi,  P2, ....  Pk,  then  /(y|Gj)  =  Np(pi,  2/),  and  the  optimal  rule  in  (9.1 1) 
based  on  p;  f  (y|G,)  becomes:  Assign  y  to  the  group  for  which 

Qi(y)  =  In  Pi  -  5  In  |S,'  |  -  i(y  -  y,)'Sr'(y  -  y;)  (9.14) 

is  maximum.  If  pi  =  p2  =  •  •  •  =  pk  or  if  the  pfi s  are  unknown,  the  term  In p,  is 
deleted. 
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In  order  to  use  a  quadratic  classification  rule  based  on  S, ,  each  n,  must  be  greater 
than  p  so  that  S(_1  will  exist.  This  restriction  does  not  apply  to  linear  classification 
rules  based  on  Spi.  Since  more  parameters  are  estimated  with  quadratic  classification 
functions,  larger  values  of  the  n,  ’s  are  needed  for  stability  of  estimates.  Note  the 
distinction  between  p,  the  number  of  variables,  and  /),,  the  prior  probability  for  the 
i  th  group. 


9.4  ESTIMATING  MISCLASSIFICATION  RATES 

In  Chapter  8,  we  assessed  the  effectiveness  of  the  discriminant  functions  in  group 
separation  by  the  use  of  significance  tests  or  by  examining  Xj  /  Xj.  To  judge  the 
ability  of  classification  procedures  to  predict  group  membership,  we  usually  use  the 
probability  of  misclassification,  which  is  known  as  the  error  rate.  We  could  also  use 
its  complement,  the  correct  classification  rate. 

A  simple  estimate  of  the  error  rate  can  be  obtained  by  trying  out  the  classifica¬ 
tion  procedure  on  the  same  data  set  that  has  been  used  to  compute  the  classification 
functions.  This  method  is  commonly  referred  to  as  resubstitution.  Each  observation 
vector  y, j  is  submitted  to  the  classification  functions  and  assigned  to  a  group.  We 
then  count  the  number  of  correct  classifications  and  the  number  of  misclassifica- 
tions.  The  proportion  of  misclassifications  resulting  from  resubstitution  is  called  the 
apparent  error  rate.  The  results  can  be  conveniently  displayed  in  a  classification 
table  or  confusion  matrix,  such  as  Table  9.1  for  two  groups. 

Among  the  ni  observations  in  G\,n\\  are  correctly  classified  into  G\,  and  n  12 
are  misclassified  into  G 2,  where  ni  =  n\\  +  n\2-  Similarly,  of  the  ri2  observations 
in  G2,  «2I  are  misclassified  into  G\,  and  1122  are  correctly  classified  into  G 2,  where 
m  =  «21  +  «22-  Thus 


a  *  *  «12  +  «21 

Apparent  error  rate  = - 

n  1  +  «2 

n  12  +  «2l 

n\  1  +  «12  +  «21  +  «22 


Similarly,  we  can  define 


Apparent  correct  classification  rate 


»  1 1  +  «22 
n\  +  n  2 


(9.15) 


(9.16) 


Table  9.1.  Classification  Table  for  Two  Groups 


Actual 

Group 

Number  of 

Observations 

Predicted  Group 

1  2 

1 

«i 

«n 

n  12 

2 

«2 

nn 

«22 
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Table  9.2.  Classification  Table  for  the  Psychological  Data 
of  Table  5.1 


Actual 

Group 

Number  of 

Observations 

Predicted  Group 

1  2 

Male 

32 

28 

4 

Female 

32 

4 

28 

Clearly, 


Apparent  error  rate  =  1  —  apparent  correct  classification  rate. 

The  method  of  resubstitution  can  be  readily  extended  to  the  case  of  several  groups. 

The  apparent  error  rate  is  easily  obtained  and  is  routinely  provided  by  most  clas¬ 
sification  software  programs.  It  is  an  estimate  of  the  probability  that  our  classifi¬ 
cation  functions  based  on  the  present  sample  will  misclassify  a  future  observation. 
This  probability  is  called  the  actual  error  rate.  Unfortunately,  the  apparent  error  rate 
underestimates  the  actual  error  rate  because  the  data  set  used  to  compute  the  classi¬ 
fication  functions  is  also  used  to  evaluate  them.  The  classification  functions  are  opti¬ 
mized  for  the  particular  sample  and  may  be  capitalizing  on  chance  to  some  degree, 
especially  for  small  samples.  For  other  estimates  of  error  rates,  see  Rencher  (1998, 
Section  6.4).  In  Section  9.5  we  consider  some  approaches  to  reducing  the  bias  in  the 
apparent  error  rate. 

Example  9.4.(a).  We  use  the  psychological  data  of  Table  5.1  to  illustrate  the  appar¬ 
ent  error  rate  obtained  by  the  resubstitution  method  for  two  groups.  The  hypothesis 
Hq  :  £i  =  2,2  was  not  rejected  in  Example  7.3.2,  and  we  therefore  classify  each  of 
the  64  observations  using  the  linear  classification  procedure  obtained  in  Example  9.2: 
Classify  as  Gi  if  a  v  >  7.4927  and  as  G 2  otherwise.  The  resulting  classification  table 
is  given  in  Table  9.2.  By  (9.15), 


n  12  +  mi  4  +  4 

Apparent  error  rate  =  — - - —  = - =  .125. 

n  1  +  n  2  32  +  32 


□ 


Example  9.4.(b).  We  use  the  football  data  of  Table  8.3  to  illustrate  the  use  of  the 
resubstitution  method  for  estimating  the  error  rate  in  the  case  of  more  than  two 
groups.  The  sample  covariance  matrices  for  the  three  groups  are  almost  significantly 
different,  and  we  will  use  both  linear  and  quadratic  classification  functions. 

The  linear  classification  functions  L,  (y)  from(9.10)  were  given  in  Example  9.3.1 
for  the  football  data.  Using  these,  we  classify  each  of  the  90  observations.  The  results 
are  shown  in  Table  9.3. 

An  examination  of  this  data  set  in  Example  8.8  showed  that  groups  2  and  3  are 
harder  to  separate  than  1  and  2  or  1  and  3.  This  pattern  is  reflected  here  in  the  misclas- 
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Table  9.3.  Classification  Table  for  the  Football  Data  of 
Table  8.3  Using  Linear  Classification  Functions 


Actual 

Group 

Number  of 

Observations 

Predicted  Group 

1  2  3 

1 

30 

26 

1 

3 

2 

30 

1 

20 

9 

3 

30 

2 

8 

20 

Apparent  correct  classification  rate  =  - =  .733 

90 

Apparent  error  rate  =  1  —  .733  =  .267 


Table  9.4.  Classification  Table  for  the  Football  Data  of 
Table  8.3  Using  Quadratic  Classification  Functions 


Actual 

Number  of 

Predicted  Group 

Group 

Observations 

1  2 

3 

1 

30 

27  1 

2 

2 

30 

2  21 

7 

3 

30 

1  4 

25 

Apparent  correct  classification  rate 

27  +  21  +  25 

90 

811 

Apparent  error  rate 

=  1  -  .811  =  .189 

sifications.  Only  4  of  the  observation  vectors  in  group  1  are  misclassified,  whereas 
10  observations  in  each  of  groups  2  and  3  are  misclassified. 

Using  the  quadratic  classification  functions  Qj( y),  i  =  1,2,3,  in  (9.14)  and 
assuming  pi  —  p2  —  P3,  we  obtain  the  classification  results  in  Table  9.4.  There  is 
some  improvement  in  the  apparent  error  rate  using  quadratic  classification  functions. 

□ 


9.5  IMPROVED  ESTIMATES  OF  ERROR  RATES 

For  large  samples,  the  apparent  error  rate  has  only  a  small  amount  of  bias  for  esti¬ 
mating  the  actual  error  rate  and  can  be  used  with  little  concern.  For  small  samples, 
however,  it  is  overly  optimistic  (biased  downward),  as  noted  before.  We  discuss  two 
techniques  for  reducing  the  bias  in  the  apparent  error  rate,  that  is,  increasing  the 
apparent  error  rate  to  a  more  realistic  level. 
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9.5.1  Partitioning  the  Sample 

One  way  to  avoid  bias  is  to  split  the  sample  into  two  parts,  a  training  sample  used  to 
construct  the  classification  rule  and  a  validation  sample  used  to  evaluate  it.  With 
the  training  sample,  we  calculate  linear  or  quadratic  classification  functions.  We 
then  submit  each  observation  vector  in  the  validation  sample  to  the  classification 
functions  obtained  from  the  training  sample.  Since  these  observations  are  not  used 
in  calculating  the  classification  functions,  the  resulting  error  rate  is  unbiased.  To 
increase  the  information  gained,  we  could  also  reverse  the  roles  of  the  two  sam¬ 
ples  so  that  the  classification  functions  are  obtained  from  the  validation  sample  and 
evaluated  on  the  training  sample.  The  two  estimates  of  error  could  then  be  aver¬ 
aged. 

Partitioning  the  sample  has  at  least  two  disadvantages: 


1.  It  requires  large  samples  that  may  not  be  available. 

2.  It  does  not  evaluate  the  classification  function  we  will  use  in  practice.  The  esti¬ 
mate  of  error  based  on  half  the  sample  may  vary  considerably  from  that  based 
on  the  entire  sample.  We  prefer  to  use  all  or  almost  all  the  data  to  construct 
the  classification  functions  so  as  to  minimize  the  variance  of  our  error  rate 
estimate. 


9.5.2  Holdout  Method 

The  holdout  method  is  an  improved  version  of  the  sample-splitting  procedure  in 
Section  9.5.1.  In  the  holdout  procedure,  all  but  one  observation  is  used  to  compute  the 
classification  rule,  and  this  rule  is  then  used  to  classify  the  omitted  observation.  We 
repeat  this  procedure  for  each  observation,  so  that,  in  a  sample  of  size  N  =  JT  n ,• , 
each  observation  is  classified  by  a  function  based  on  the  other  N  —  1  observations. 
The  computation  load  is  increased  because  N  distinct  classification  procedures  have 
to  be  constructed.  The  holdout  procedure  is  also  referred  to  as  the  leaving-one-out 
method  or  as  cross  validation.  Note  that  this  procedure  is  used  to  estimate  error 
rates.  The  actual  classification  rule  for  future  observations  would  be  based  on  all  N 
observations. 


Example  9.5.2.  We  use  the  football  data  of  Table  8.3  to  illustrate  the  holdout  method 
for  estimating  the  error  rate.  Each  of  the  90  observations  is  classified  by  linear  clas¬ 
sification  functions  based  on  the  other  89  observations.  To  begin  the  procedure,  the 
first  observation  in  group  1  (yn)  is  held  out  and  the  linear  classification  functions 
Lj( y),  i  =  1,  2,  3,  in  (9.10)  are  calculated  using  the  remaining  29  observations  in 
group  1  and  the  60  observations  in  groups  2  and  3.  The  observation  yn  is  now  clas¬ 
sified  using  L\( y),  Liiy),  and  L3 (y).  Then  yn  is  reinserted  in  group  1,  and  yi2  is 
held  out.  The  functions  L\ (y),  Lziy),  and  L^(y)  are  recomputed  and  yi2  is  then  clas¬ 
sified.  This  procedure  is  followed  for  each  of  the  90  observations,  and  the  results  are 
in  Table  9.5. 
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Table  9.5.  Classification  Table  for  the  Football  Data  of 
Table  8.3  Using  the  Holdout  Method  Based  on  Linear 


Classification  Functions 

Actual 

Number  of 

Predicted  Group 

Group 

Observations 

1 

2 

3 

1 

30 

26 

1 

3 

2 

30 

1 

18 

11 

3 

30 

2 

9 

19 

Correct  classification  rate  =  - =  .700 

90 

Error  rate  =  1  —  .700  =  .300 


As  expected,  the  holdout  error  rate  has  increased  somewhat  from  the  apparent 
error  rate  based  on  resubstitution  in  Tables  9.3  and  9.4  in  Example  9.4.(b).  An  error 
rate  of  .300  is  a  less  optimistic  (more  realistic)  estimate  of  what  the  classification 
functions  can  do  with  future  samples.  □ 


9.6  SUBSET  SELECTION 

The  experimenter  often  has  available  a  large  number  of  variables  and  wishes  to  keep 
any  that  might  aid  in  predicting  group  membership  but  at  the  same  time  to  delete  any 
superfluous  variables  that  do  not  contribute  to  allocation.  A  reduction  in  the  number 
of  redundant  variables  may  in  fact  lead  to  improved  error  rates.  As  an  additional  con¬ 
sideration,  there  is  an  increase  in  robustness  to  nonnormality  of  linear  and  quadratic 
classification  functions  as  p  (the  number  of  variables)  decreases. 

The  majority  of  selection  schemes  for  classification  analysis  are  based  on  step¬ 
wise  discriminant  analysis  or  a  similar  approach  (Section  8.9).  One  finds  the  subset 
of  variables  that  best  separates  groups  using  Wilks’  A,  for  example,  and  then  uses 
these  variables  to  construct  classification  functions.  Most  of  the  major  statistical  soft¬ 
ware  packages  offer  this  method.  When  the  “best”  subset  is  selected  in  this  way,  an 
optimistic  bias  in  error  rates  is  introduced.  For  a  discussion  of  this  bias,  see  Rencher 
(1992a;  1998,  Section  6.7). 

Another  link  between  separation  and  classification  is  the  use  of  error  rates  in  an 
informal  stopping  rule  in  a  stepwise  discriminant  analysis.  Thus,  for  example,  if  a 
subset  of  5  variables  out  of  10  gives  a  misclassification  rate  of  33%  compared  to 
30%  for  the  full  set  of  variables,  we  may  decide  that  the  5  variables  are  adequate  for 
separating  the  groups.  We  could  try  several  subsets  of  decreasing  sizes  to  see  when 
the  error  rate  begins  to  escalate  noticeably. 

Example  9.6.(a).  In  Example  8.9,  a  stepwise  discriminant  analysis  based  on  a  par¬ 
tial  Wilks’  A  (or  partial  F)  was  carried  out  for  the  football  data  of  Table  8.3.  Four 
variables  were  selected:  EYEHD,  WDIM,  JAW,  and  EARHD.  These  same  four  vari- 
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Table  9.6.  Classification  Table  for  the  Football  Data  of 
Table  8.3  Using  Linear  Classification  Functions  Based  on 
Four  Variables  Chosen  by  Stepwise  Selection 


Actual 

Group 

Number  of 

Observations 

Predicted  Group 

1  2  3 

1 

30 

26 

1 

3 

2 

30 

1 

20 

9 

3 

30 

2 

8 

20 

ables  are  indicated  by  the  coefficients  in  the  linear  classification  functions  in  Exam¬ 
ple  9.3.1.  We  now  use  these  four  variables  to  classify  the  observations  using  the 
method  of  resubstitution  to  obtain  the  apparent  error  rate. 

The  linear  classification  functions  (9.10)  are 

Group  1:  Li(y)  =  y^y -  ^S^'y, 

=  18.67yi  +  4.13y2  +  17.67y3  +  20.44y4  -  425.50, 

Group  2:  L2(y)  =  21.13yi  +  1.96y2  +  16.24y3  +  18.36v4  -  392.75, 

Group  3:  L3(y)  =  21.87yi  +  2.67 y2  +  16.13y3  +  17.46y4  —  399.63. 

When  each  observation  vector  is  classified  using  these  linear  functions,  we  obtain 
the  classification  results  in  Table  9.6. 

Table  9.6  is  identical  to  Table  9.3  in  Example  9.4. (b),  where  all  six  variables 
were  used.  Thus  the  four  selected  variables  can  classify  the  sample  as  well  as  all  six 
variables  classify  it.  □ 

Example  9.6.(b).  We  illustrate  the  use  of  error  rates  as  an  informal  stopping  rule  in 
a  stepwise  discriminant  analysis.  Fifteen  teacher  and  pupil  behaviors  were  observed 
during  5-min  intervals  of  reading  instruction  in  elementary  school  classrooms 
(Rencher,  Wadham,  and  Young  1978).  The  observations  were  recorded  in  rate  of 
occurrences  per  minute  for  each  variable.  The  variables  were  the  following: 

Teacher  Behaviors 

1.  Explains — Explains  task  to  learner. 

2.  Models — Models  the  task  response  for  the  learner. 

3.  Questions — Asks  a  question  to  elicit  a  task  response. 

4.  Directs — Gives  a  direct  signal  to  elicit  a  task  response. 

5.  Controls — Controls  management  behavior  with  direction  statements  or  ges¬ 
tures. 

6.  Positive — Gives  a  positive  (affirmative)  statement  or  gesture. 

7.  Negative — Gives  a  negative  statement  or  gesture. 
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Pupil  Behaviors 

8.  Overt  delayed — An  overt  learner  response  to  task  signals  that  cannot  be 
judged  correct  or  incorrect  until  later. 

9.  Correct — A  correct  learner  response  with  relationship  to  task  signals. 

10.  Incorrect — An  incorrect  learner  response  with  relationship  to  task  signals. 

11.  No  response — Learner  gives  no  response  with  relationship  to  task  signals. 

12.  Asks — Learner  asks  a  question  about  the  task. 

13.  Statement — Learner  gives  a  positive  statement  or  gestures  not  related  to  the 
task. 

14.  Inappropriate — Learner  gives  in  appropriate  management  behavior. 

15.  Other — Other  learner  than  one  being  observed  gives  responses  as  teacher 
directs  task  signals. 

The  teachers  were  grouped  into  four  categories: 

Group  1 :  Outstanding  teachers, 

Group  2:  Poor  teachers. 

Group  3:  First-year  teachers, 

Group  4:  Teacher  aides. 

The  sample  sizes  in  groups  1-4  were  62,  61,  57,  and  41,  respectively.  Because  of  the 
large  values  of  N  and  p  (N  =  221,  p  =  15),  the  data  are  not  given  here. 

The  stepwise  discriminant  analysis  was  run  several  times  with  different  thresh¬ 
old  F-to-enter  values  in  order  to  select  subsets  with  different  sizes.  A  classification 
analysis  based  on  resubstitution  was  carried  out  with  each  of  the  resulting  subsets  of 
variables.  In  Table  9.7,  we  compare  the  overall  Wilks’  A  and  the  apparent  correct 
classification  rate. 

According  to  the  correct  classification  rate,  we  would  choose  to  stop  at  five  vari¬ 
ables  because  of  the  abrupt  change  from  5  to  4.  On  the  other  hand,  the  changes  in 
Wilks’  A  are  more  gradual,  and  no  clear  stopping  point  is  indicated.  □ 


Table  9.7.  Stepwise  Selection  Statistics  for  the  Teacher  Data 


Number  of 

Variables 

Overall 

Wilks’  A 

Percentage  of  Correct 
Classification 

15 

.132 

77.4 

10 

.159 

72.4 

9 

.170 

73.3 

8 

.182 

70.6 

7 

.195 

72.9 

6 

.211 

70.1 

5 

.231 

70.6 

4 

.256 

65.6 
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9.7  NONPARAMETRIC  PROCEDURES 

We  have  previously  discussed  both  parametric  and  nonparametric  procedures. 
Welch's  optional  rule  in  (9.7)  and  (9.11)  is  parametric,  whereas  Fisher’s  linear 
classification  rule  for  two  groups  as  given  in  (9.5)  and  (9.6)  is  essentially  nonpara¬ 
metric,  since  no  distributional  assumptions  were  involved  in  its  derivation.  However, 
Fisher’s  procedure  also  turns  out  to  be  equivalent  to  the  optimal  normal-based 
approach  in  (9.8).  Nonparametric  procedures  for  estimating  error  rate  include  the 
resubstitution  and  holdout  methods.  In  the  next  three  sections  we  discuss  three 
additional  nonparametric  classification  procedures. 


9.7.1  Multinomial  Data 

We  now  consider  data  in  which  an  observation  vector  consists  of  responses  on  each 
of  several  categorical  variables.  The  various  combinations  of  categories  constitute 
the  possible  outcomes  of  a  multinomial  random  variable.  For  example,  consider  the 
following  four  categorical  variables:  gender  (male  or  female),  political  party  (Repub¬ 
lican,  Democrat,  other),  size  of  city  of  residence  (under  10,000,  between  10,000  and 
100,000,  over  100,000),  and  education  (less  than  high  school  graduation,  high  school 
graduate,  college  graduate,  advanced  degree).  An  observation  vector  might  be  (2,  1, 
3,  4),  that  is,  a  female  Republican  who  lives  in  a  city  of  over  100,000  and  is  a  college 
graduate.  The  total  number  of  possible  outcomes  in  this  multinomial  distribution  is 
the  product  of  the  number  of  states  of  the  individual  variables:  2x3x3x4  =  72. 
We  will  use  this  example  to  illustrate  classification  procedures  for  multinomial  data. 
Suppose  we  are  attempting  to  predict  whether  or  not  a  person  will  vote.  Then  there 
are  two  groups,  G  i  and  G 2,  and  we  assign  a  person  to  one  of  the  groups  after  observ¬ 
ing  which  of  the  72  possible  outcomes  he  or  she  gives. 

Welch’s  (1939)  optimum  rule  given  in  (9.7)  can  be  written  as:  Assign  y  to  Gi  if 


/(y|Gi)  >  P2 
f(y\Gi)  >  pi 


(9.17) 


and  to  G 2  otherwise.  In  our  categorical  example,  /(y|Gi)  is  represented  by  <71/, 
i  —  1,2,...,  72,  and  /(y|G2)  becomes  <72/,  1  =  1,2,...,  72,  where  qu  is  the 
probability  that  a  person  in  group  1  will  give  the  ith  outcome,  with  an  analogous 
definition  for  c/2; .  In  terms  of  these  multinomial  probabilities,  the  classification  rule 
in  (9.17)  becomes:  If  a  person  gives  the  ith  outcome,  assign  him  or  her  to  G\  if 


qu  P2 
qii  Pi 


(9.18) 


and  to  G 2  otherwise.  If  the  probabilities  qu  and  c/2,  were  known,  it  would  be  easy 
to  check  (9.18)  for  each  i  and  partition  the  72  possible  outcomes  into  two  subsets, 
those  for  which  the  person  would  be  assigned  to  G 1  and  those  corresponding  to  Gj. 

The  values  of  q\\  and  c/2;  are  usually  unknown  and  must  be  estimated  from  a 
sample.  Let  nu  and  ni\  be  the  numbers  of  persons  in  groups  1  and  2  who  give  the 
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;th  outcome,  i  =  1,2, ,  72.  Then  we  estimate  qu  and  q2i  by 
„  Alii  „  H2  i 

qu  =  —  and  q2i  =  — ,  i  =  1,  2, . . .  ,  72,  (9.19) 

*  Ah  N2 

where  /V|  =  JV  n  i;  and  Ah  =  n2l.  However,  a  large  sample  size  would  be 
required  for  stable  estimates;  in  any  given  example,  some  of  the  n’ s  may  be  zero. 

Multinomial  data  can  also  be  classified  by  ordinary  linear  classification  functions. 
We  must  distinguish  between  ordered  and  unordered  categories.  If  all  the  variables 
have  ordered  categories,  the  data  can  be  submitted  directly  to  an  ordinary  classifi¬ 
cation  program.  In  the  preceding  example,  city  size  and  education  are  variables  of 
this  type.  It  is  customary  to  assign  ordered  categories  ranked  values  such  as  1,2,  3, 
4.  It  has  been  shown  that  linear  classification  functions  perform  reasonably  well  on 
(ordered)  discrete  data  of  this  type  [see  Lachenbruch  (1975,  p.  45),  Titterington  et  al. 
(1981),  and  Gilbert  (1968)]. 

Unordered  categorical  variables  cannot  be  handled  this  same  way.  Thus  the  polit¬ 
ical  party  variable  in  the  preceding  example  should  not  be  coded  1,2,3  and  entered 
into  the  computation  of  the  classification  functions.  However,  an  unordered  cate¬ 
gorical  variable  with  k  categories  can  be  replaced  by  A:  —  1  dummy  variables  (see 
Sections  6.1.8  and  11.6.2)  for  use  with  linear  classification  functions.  For  exam¬ 
ple,  the  political  preference  variable  with  three  categories  can  be  converted  to  two 
dummy  variables  as  follows: 


.VI 


1  if  Republican,  _  1  if  Democrat, 

0  otherwise,  V2  0  otherwise. 


Thus  the  (yi,  >>2)  pair  takes  the  value  (1,  0)  for  a  Republican,  (0,  1)  for  a  Democrat, 
and  (0,  0)  for  other.  Many  software  programs  will  create  dummy  variables  automati¬ 
cally.  Note  that  if  a  subset  selection  program  is  used,  the  dummy  variables  for  a  given 
categorical  variable  must  be  kept  together;  that  is,  they  must  all  be  included  in  the 
chosen  subset  or  all  excluded,  because  all  are  necessary  to  describe  the  categorical 
variable. 

In  some  cases,  such  as  in  medical  data  collection,  there  is  a  mixture  of  continuous 
and  categorical  variables.  Various  approaches  to  classification  with  such  data  have 
been  discussed  by  Krzanowski  (1975,  1976,  1977,  1979,  1980),  Lachenbruch  and 
Goldstein  (1979),  Tu  and  Han  (1982),  and  Bayne  et  al.  (1983).  See  Rencher  (1998, 
Section  6.8)  for  a  discussion  of  logistic  and  probit  classification,  which  are  useful  for 
certain  types  of  continuous  and  discrete  data  that  are  not  normally  distributed. 


9.7.2  Classification  Based  on  Density  Estimators 

In  (9.8),  (9.12),  and  (9.14)  we  have  linear  and  quadratic  classification  rules  based 
on  the  multivariate  normal  density  and  prior  probabilities.  These  normal-based  rules 
arose  from  Welch's  optimal  rule  that  assigns  y  to  the  group  for  which  pif(y\Gj)  is 
maximum.  If  the  form  of  /(y|G,-)  is  nonnormal  and  unknown,  the  density  can  be 
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estimated  directly  from  the  data.  The  approach  we  describe  is  known  as  the  kernel 
estimator. 

We  first  describe  the  kernel  method  for  a  univariate  continuous  random  vari¬ 
able  y.  Suppose  y  has  density  /(y),  which  we  wish  to  estimate  using  a  sample 
yi,  yi,  ■  ■  ■  ,  yn-  A  simple  estimate  of  /(yo)  for  an  arbitrary  point  yo  can  be  based 
on  the  proportion  of  points  in  the  interval  (yo  —  h,  yo  +  h).  If  the  number  of  points 
in  the  interval  is  denoted  by  N (yo),  then  the  proportion  N(yo)/n  is  an  estimate  of 
P (yo  — h  <  y  <  yo  +  h),  which  is  approximately  equal  to  2hf  (yo).  Thus  we  estimate 
/(.to)  by 


/(.vo)  = 


N  (yo) 

2  hn 


We  can  express  /(yo)  as  a  function  of  all  y;  in  the  sample  by  defining 

l 


2  for  |  u  |  <  1 , 
0  for  \u\  >  1, 


K(u)  = 

so  that  N(yo )  =  2  K[(yo  —  y/ )/h],  and  (9.20)  becomes 

=  ht.K  I 


(9.20) 


(9.21) 


hn  f-  , 
;  =  1 


yo  -  yt 


(9.22) 


The  function  K (u)  is  called  the  kernel.  In  (9.22),  K[(yo  —  y/)/ h]  is  j  for  any  point 
y,-  in  the  interval  (yo  —  h,  yo  +  h)  and  is  zero  for  points  outside  the  interval.  Points 
in  the  interval  add  1  /2 Im  to  the  density  and  points  outside  the  interval  contribute 
nothing. 

Kernel  estimators  were  first  proposed  by  Rosenblatt  (1956)  and  Parzen  (1962). 
A  good  review  of  nonparametric  density  estimation  including  kernel  estimators  has 
been  given  by  Silverman  (1986),  who  noted  that  classification  analysis  provided  the 
initial  motivation  for  the  development  of  density  estimation. 

The  kernel  defined  by  (9.21)  is  rectangular,  and  the  graph  of  /(yo)  plotted  as  a 
function  of  yo  will  be  a  step  function,  since  there  will  be  a  jump  (or  drop)  whenever 
yo  is  a  distance  h  from  one  of  the  y,  ’s.  (A  moving  average  has  a  similar  property.)  To 
obtain  a  smooth  estimator  of  /(y),  we  must  choose  a  smooth  kernel.  Two  possibili¬ 
ties  are 


K(u)  = 


1  sin2  u 
i r  u 2 


K(u)  = 


1  c~»2// 
V2n 


(9.23) 

(9.24) 


which  have  the  property  that  all  n  sample  points  yi,  yi,  ■  ■  ■  ,  y«  contribute  to  /(yo), 
with  the  closest  points  weighted  heavier  than  the  more  distant  points.  Even  though 
K(u)  in  (9.24)  has  the  form  of  the  normal  distribution,  this  does  not  imply  any 
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assumption  about  the  density  f(y).  We  have  used  the  normal  density  function 
because  it  is  symmetric  and  unimodal.  Other  density  functions  could  be  used  as 
kernels. 

Cacoullos  (1966)  provided  kernel  estimates  for  multivariate  density  functions;  see 
also  Scott  (1992).  If  yp  =  (you  >’02 , . . .  ,  yop )  is  an  arbitrary  point  whose  density  we 
wish  to  estimate,  then  the  extension  of  (9.22)  is 


/( yo)  = 


nhih-2  ■  ■  -  hp  “ 


J2k 


yot  -  yn 


h 


yoP  -  yip 

h  n 


(9.25) 


An  estimate  /(yo)  based  on  a  multivariate  normal  kernel  is  given  by 


/(yo)  = 


_ - _ y 

nhP |Spi|'/2 


-(yo-y;)'s_11(y0-y;)/2/>2 


(9.26) 


where  h\  =  hi  =  •  •  •  =  h p  —  h  and  Spi  is  the  pooled  covariance  matrix  from  the 
k  groups  in  the  sample.  The  covariance  matrix  Spi  could  be  replaced  by  other  forms. 
Two  examples  are  (1)  S,  for  the  ;th  group  and  (2)  a  diagonal  matrix. 

The  choice  of  the  smoothing  parameter  h  is  critical  in  a  kernel  density  estimator. 
The  size  of  h  determines  how  much  each  y,  contributes  to  /(yo)-  If  h  is  too  small, 
/(yo)  has  a  peak  at  each  y,-,  and  if  h  is  too  large,  /(yo)  is  almost  uniform  (overly 
smoothed).  Therefore,  the  value  chosen  for  h  must  depend  on  the  sample  size  n  to 
avoid  too  much  or  too  little  smoothing;  the  larger  the  sample  size,  the  smaller  h 
should  be.  In  practice,  we  could  try  several  values  of  h  and  check  the  resulting  error 
rates  from  the  classification  analysis. 

To  use  the  kernel  method  of  density  estimation  in  classification,  we  can  apply  it  to 
each  group  to  obtain  /(yolGj),  /(yo|G2),  -  ■  -  ,  /(yolGffc),  where  y0  is  the  vector  of 
measurements  for  an  individual  of  unknown  group  membership.  The  classification 
rule  then  becomes;  Assign  yo  to  the  group  Gj  for  which 

Pi  f  (yo  I  Gj )  is  maximum.  (9.27) 


Habbema,  Hermans,  and  Van  den  Broek  (1974)  proposed  a  forward  selection 
method  for  classification  based  on  density  estimation.  Wegman  (1972)  and  Habbema, 
Hermans,  and  Remme  (1978)  found  that  the  size  of  the  hi’ s  is  more  important  than 
the  shape  of  the  kernel.  The  choice  of  h  was  investigated  by  Pfeiffer  (1985)  in  a 
stepwise  mode.  Remme,  Habbema,  and  Hermans  (1980)  compared  linear,  quadratic, 
and  kernel  classification  methods  for  two  groups  and  reported  that  for  multivariate 
normal  data  with  equal  covariance  matrices,  the  linear  classifications  were  clearly 
superior.  For  some  cases  with  departures  from  these  assumptions,  the  kernel  meth¬ 
ods  gave  better  results. 


Example  9.7.2.  We  illustrate  the  density  estimation  method  of  classification  for  the 
football  data  of  Table  8.3.  We  use  the  multivariate  normal  kernel  estimator  in  (9.26) 
with  li  —  2  to  obtain  /(yo|G,),  i  =  1,  2,  3,  for  the  three  groups.  Using  p\  — 
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P2  =  P3,  the  rule  in  (9.27)  becomes:  Assign  yo  to  the  group  for  which  /(yo|G,)  is 
greatest.  To  obtain  an  apparent  error  rate,  we  follow  this  procedure  for  each  of  the  90 
observations  and  obtain  the  classification  results  in  Table  9.8. 

Applying  a  holdout  method  in  which  the  observation  y,y  being  classified  is 
excluded  from  computation  of  /(y,y|Gi  ),  /(y,y|G2),  and  /(y;/ G3),  we  obtain  the 
classification  results  in  Table  9.9.  As  expected,  the  holdout  error  rate  has  increased 
somewhat  from  the  apparent  error  rate  in  Table  9.8.  □ 

9.7.3  Nearest  Neighbor  Classification  Rule 

The  earliest  nonparametric  classification  method  was  the  nearest  neighbor  rule  of 
Fix  and  Hodges  (1951),  also  known  as  the  k  nearest  neighbor  rule.  The  procedure 
is  conceptually  simple.  We  compute  the  distance  from  an  observation  y,-  to  all  other 
points  y j  using  the  distance  function 

(y;  -yjYSpiiyt  -yjh 


Table  9.8.  Classification  Table  for  the  Football  Data  of 
Table  8.3  Using  the  Density  Estimation  Method  of  Clas¬ 
sification  with  Multivariate  Normal  Kernel 


Actual 

Number  of 

Predicted  Group 

Group 

Observations 

1 

2 

3 

1 

30 

25 

1 

4 

2 

30 

0 

12 

18 

3 

30 

0 

3 

27 

Apparent  correct  classification  rate 

25  +  12  +  27 

—  —  711 

90 

Apparent  error  rate 

=  1 

-  .711  =  .289 

Table  9.9.  Classification  Table  for  the  Football  Data  of 
Table  8.3  Using  the  Holdout  Method  Based  on  Density 
Estimation 


Actual 

Group 

Number  of 

Observations 

Predicted  Group 

1  2  3 

1 

30 

24 

1 

5 

2 

30 

0 

10 

20 

3 

30 

1 

3 

26 

Correct  classification  rate  =  - =  .667 

90 

Error  rate  =  71  —  .667  =  .333 
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To  classify  y,  into  one  of  two  groups,  the  k  points  nearest  to  y,  are  examined,  and 
if  the  majority  of  the  k  points  belong  to  Gi,  assign  y,-  to  Gj;  otherwise  assign  y,  to 
G 2-  If  we  denote  the  number  of  points  from  G\  as  k\ ,  with  the  remaining  k2  points 
from  Gi ,  where  k  —  k\  +  k2,  then  the  rule  can  be  expressed  as:  Assign  y,-  to  G\  if 

ki  >  k2  (9.28) 


and  to  G 2  otherwise.  If  the  sample  sizes  77 1  and  n2  differ,  we  may  wish  to  use  pro¬ 
portions  in  place  of  counts:  Assign  y,  to  Gi  if 


k\  k2 

n  1  772 


(9.29) 


A  further  refinement  can  be  made  by  taking  into  account  prior  probabilities:  Assign 
y i  to  Gi  if 


k\/n\  p2 
-  >  — . 

k2/n2  p\ 


(9.30) 


These  rules  are  easily  extended  to  more  than  two  groups.  For  example,  (9.29) 
becomes:  Assign  the  observation  to  the  group  that  has  the  highest  proportion  kj /ni, 
where  k ;  is  the  number  of  observations  from  G,  among  the  k  nearest  neighbors  of 
the  observation  in  question. 

A  decision  must  be  made  as  to  the  value  of  k.  Loftsgaarden  and  Quesenberry 
(1965)  suggest  choosing  k  near  ^/ni  for  a  typical  77; .  In  practice,  one  could  try  several 
values  of  k  and  use  the  one  with  the  best  error  rate. 

Reviews  and  extensions  of  the  nearest  neighbor  method  have  been  given  by  Hart 
(1968),  Gates  (1972),  Hand  and  Batchelor  (1978),  Chidananda  Gowda  and  Krishna 
(1979),  Rogers  and  Wagner  (1978),  and  Brown  and  Koplowitz  (1979). 


Example  9.7.3.  We  use  the  football  data  of  Table  8.3  to  illustrate  the  k  nearest  neigh¬ 
bor  method  of  estimating  error  rate,  with  k  =  5.  Since  77 1  =  n2  =  773  =  30  and  the 
Pi’s  are  also  assumed  to  be  equal,  we  simply  examine  the  five  points  closest  to  a 


Table  9.10.  Classification  Table  for  the  Football  Data 
of  Table  8.3  Using  the  k  Nearest  Neighbor  Method  with 
k  =  5 


Actual 

Group 

Number  of 

Observations 

Predicted  Group 

1  2  3 

1 

30 

26 

0 

1 

2 

30 

1 

19 

9 

3 

30 

1 

4 

22 

Correct  classification  rate  =  - =  .807 

83 

Error  rate  =  1  —  .807  =  .193 
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point  y  and  classify  y  into  the  group  that  has  the  most  points  among  the  five  points. 
If  there  is  a  tie,  we  do  not  classify  the  point.  For  example,  if  the  numbers  from  G  i , 
G 2,  and  G 3  were  1,  2,  and  2,  respectively,  then  we  do  not  assign  y  to  either  G 2  or  G3. 

For  each  point  y,j ,/ =  1.2,3;  j  =  1,2,...  ,  30,  we  find  the  five  nearest  neighbors 
and  classify  the  point  accordingly.  Table  9.10  gives  the  classification  results.  As  can 
be  seen,  there  were  3  observations  in  group  1  that  were  not  classified  because  of  ties, 
1  in  group  2,  and  3  in  group  3.  This  left  a  total  of  83  observations  classified.  □ 


PROBLEMS 

9.1  Show  that  if  zu  =  a'yj,-, ;  =  1,2,...  ,  n\,  and  zn  =  a'y2 i  —  1,2,...  ,  n2, 
where  z  is  the  discriminant  function  defined  in  (9.1),  then  zi  —  Z2  =  (yi  — 

-y2)  as  in  (9.3). 

9.2  With  z  —  a'y  as  in  (9.1)  and  zi  =  a'y x,  z2  =  a'y2,  show  that  4(zi  +  zi)  — 
y(yi  -  y2),Sp11  (yi  +  y2)  as  in  (9.4). 

9.3  Obtain  the  normal-based  classification  rule  in  (9.8). 

9.4  Derive  the  linear  classification  rule  in  (9.12). 

9.5  Derive  the  quadratic  classification  function  in  (9.14). 

9.6  Do  a  classification  analysis  on  the  beetle  data  in  Table  5.5  as  follows: 

(a)  Find  the  classification  function  z  =  (Ji  —  y^VSp/y  and  the  cutoff  point 
\(Z\  +  Z2)- 

(b)  Find  the  classification  table  using  the  linear  classification  function  in 
part  (a). 

(c)  Find  the  classification  table  using  the  nearest  neighbor  method. 

9.7  Do  a  classification  analysis  on  the  dystrophy  data  of  Table  5.7  as  follows: 

(a)  Find  the  classification  function  z  =  (Ji  —  y2)'S“11y  and  the  cutoff  point 

+  Z2)- 

(b)  Find  the  classification  table  using  the  linear  classification  function  in 
part  (a). 

(c)  Repeat  part  (b)  using  pi  and  pi  proportional  to  sample  sizes. 

9.8  Do  a  classification  analysis  on  the  cyclical  data  of  Table  5.8  as  follows: 

(a)  Find  the  classification  function  z  =  (Ji  —  1  y  and  the  cutoff  point 

2(Z1  +  Z2)- 

(b)  Find  the  classification  table  using  the  linear  classification  function  in 
part  (a). 

(c)  Find  the  classification  table  using  the  holdout  method. 

(d)  Find  the  classification  table  using  a  kernel  density  estimator  method. 
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9.9  Using  the  engineer  data  of  Table  5.6,  carry  out  a  classification  analysis  as  fol¬ 
lows: 

(a)  Find  the  classification  table  using  the  linear  classification  function. 

(b)  Carry  out  a  stepwise  discriminant  selection  of  variables  (see  Problem  8.14). 

(c)  Find  the  classification  table  for  the  variables  selected  in  part  (b). 

9.10  Do  a  classification  analysis  on  the  fish  data  in  Table  6.17  as  follows.  Assume 

P\  =  Pi  =  Pi  ■ 

(a)  Find  the  linear  classification  functions. 

(b)  Find  the  classification  table  using  the  linear  classification  functions  in 
part  (a)  (assuming  Xi  —X2  —  X3). 

(c)  Find  the  classification  table  using  quadratic  classification  functions  (assum¬ 
ing  population  covariance  matrices  are  not  equal). 

(d)  Find  the  classification  table  using  linear  classification  functions  and  the 
holdout  method. 

(e)  Find  the  classification  table  using  a  nearest  neighbor  method. 

9.11  Do  a  classification  analysis  on  the  rootstock  data  of  Table  6.2  as  follows: 

(a)  Find  the  linear  classification  functions. 

(b)  Find  the  classification  table  using  the  linear  classification  functions  in 
part  (a)  (assuming  Xi  =  X2  =  X3). 

(c)  Find  the  classification  table  using  quadratic  classification  functions  (assum¬ 
ing  population  covariance  matrices  are  not  equal). 

(d)  Find  the  classification  table  using  the  nearest  neighbor  method. 

(e)  Find  the  classification  table  using  a  kernel  density  estimator  method. 


CHAPTER  10 


Multivariate  Regression 


10.1  INTRODUCTION 

In  this  chapter,  we  consider  the  linear  relationship  between  one  or  more  y’s  (the 
dependent  or  response  variables)  and  one  or  more  x’s  (the  independent  or  predictor 
variables).  We  will  use  a  linear  model  to  relate  the  y’s  to  the  x’s  and  will  be  concerned 
with  estimation  and  testing  of  the  parameters  in  the  model.  One  aspect  of  interest  will 
be  choosing  which  variables  to  include  in  the  model  if  this  is  not  already  known. 

We  can  distinguish  three  cases  according  to  the  number  of  variables: 

1.  Simple  linear  regression:  one  y  and  one  x.  For  example,  suppose  we  wish  to 
predict  college  grade  point  average  (GPA)  based  on  an  applicant’s  high  school 
GPA. 

2.  Multiple  linear  regression:  one  y  and  several  x’s.  We  could  attempt  to  improve 
our  prediction  of  college  GPA  by  using  more  than  one  independent  variable, 
for  example,  high  school  GPA,  standardized  test  scores  (such  as  ACT  or  SAT), 
or  rating  of  high  school. 

3.  Multivariate  multiple  linear  regression:  several  y’s  and  several  x’s.  In  the  pre¬ 
ceding  illustration,  we  may  wish  to  predict  several  y’s  (such  as  number  of 
years  of  college  the  person  will  complete  or  GPA  in  the  sciences,  arts,  and 
humanities).  As  another  example,  suppose  the  Air  Force  wishes  to  predict  sev¬ 
eral  measures  of  pilot  efficiency.  These  response  variables  could  be  regressed 
against  independent  variables  (such  as  math  and  science  skills,  reaction  time, 
eyesight  acuity,  and  manual  dexterity). 

To  further  distinguish  case  2  from  case  3,  we  could  designate  case  2  as  univariate 
multiple  regression  because  there  is  only  one  y .  Thus  in  case  3,  multivariate  indicates 
that  there  are  several  y’s  and  multiple  implies  several  x’s.  The  term  multivariate 
regression  usually  refers  to  case  3. 

There  are  two  basic  types  of  independent  variables,  fixed  and  random.  In  the  pre¬ 
ceding  illustrations,  all  x’s  are  random  variables  and  are  therefore  not  under  the  con¬ 
trol  of  the  researcher.  A  person  is  chosen  at  random,  and  all  the  y’s  and  x’s  are 
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measured,  or  observed,  for  that  person.  In  some  experimental  situations,  the  x’s  are 
fixed,  that  is,  under  the  control  of  the  experimenter.  For  example,  a  researcher  may 
wish  to  relate  yield  per  acre  and  nutritional  value  to  level  of  application  of  various 
chemical  fertilizers.  The  experimenter  can  choose  the  amount  of  chemicals  to  be 
applied  and  then  observe  the  changes  in  the  yield  and  nutritional  responses. 

In  order  to  provide  a  solid  base  for  multivariate  multiple  regression,  we  review 
several  aspects  of  multiple  regression  with  fixed  x’s  in  Section  10.2.  The  random-x 
case  for  multiple  regression  is  discussed  briefly  in  Section  10.3. 


10.2  MULTIPLE  REGRESSION:  FIXED  x’s 
10.2.1  Model  for  Fixed  x’s 

In  the  fixed-x  regression  model,  we  express  each  y  in  a  sample  of  n  observations  as 
a  linear  function  of  the  x’s  plus  a  random  error,  e : 

yi  —  Po  +  P\X\l  +  H - 1-  PqX\q  +  Si 

yi  —  Po  +  P\X2l  +  P2X22  +  •  •  •  +  PqX2q  +  £2 

.  (10.1) 

yn  —  Po  +  P\Xn\  "b  P2Xn2  T  *  *  *  T  PqXnq  T  S/i  ■ 

The  number  of  x’s  is  denoted  by  q.  The  P’s  in  (10.1)  are  called  regression  coef¬ 
ficients.  Additional  assumptions  that  accompany  the  equations  of  the  model  are  as 
follows: 

1.  E(ej)  =  0  for  all  i  =  1,2,...  ,  n. 

2.  var (£,■)  =  a1 2 3  for  all  i  =  1,2,...  ,  n. 

3.  co v(s/,  Sj)  —  0  for  all  i  ^  j . 

Assumption  1  states  that  the  model  is  linear  and  that  no  additional  terms  are  needed 
to  predict  y\  all  remaining  variation  in  y  is  purely  random  and  unpredictable.  Thus 
if  E(ei)  =  0  and  the  x’s  are  fixed,  then  E(y,)  =  Pq  +  P\xn  +  P2X12  +  •  •  •  +  Pqxtq, 
and  the  mean  of  y  is  expressible  in  terms  of  these  q  x’s  with  no  others  needed.  In 
assumption  2,  the  variance  of  each  e,  is  the  same,  which  also  implies  that  varl  v/ )  = 
cr2,  since  the  x’s  are  fixed.  Assumption  3  imposes  the  condition  that  the  error  terms 
be  uncorrelated,  from  which  it  follows  that  the  y’s  are  also  uncorrelated,  that  is, 
cov(y/,  yj)  =  0. 

Thus  the  three  assumptions  can  be  restated  in  terms  of  y  as  follows: 

1.  E{yt)  —  Po  +  P\xn  +  P2Xi2  H - b  PqXiq ,  t  =  1,  2, ...  ,  n. 

2.  var(y,-)  =  a2,  i  =  1,  2, . . .  ,  n. 

3.  cov(y,-,  yj)  —  0,  for  all  i  /  j . 
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Using  matrix  notation,  the  models  for  the  n  observations  in  (10.1)  can  be  written 
much  more  concisely  in  the  form 


>’l  ^ 

(  1  Xu  X[2  ■■■  Xlq  \ 

(  Po  \ 

<  E\  ^ 

yi 

1  X21  X22  ■■■  X2q 

£2 

— 

+ 

yn  ) 

\  1  X/i  |  Xji 2  •  •  •  Xnq  J 

\  P,  ) 

V  £»  / 

y  =  X/3  +  e. 

With  this  notation,  the  preceding  three  assumptions  become 


(10.2) 


(10.3) 


1.  E(e)  =  0, 

2.  cov(e)  =  a2 1, 


which  can  be  rewritten  in  terms  of  y  as 

1.  E( y)  =  X/3, 

2.  cov(y)  =  er2I. 

Note  that  the  second  assumption  in  matrix  form  incorporates  both  the  second  and 
third  assumptions  in  univariate  form;  that  is,  cov(y)  =  cr2I  implies  var(y; )  =  a2  and 
cov(yj,  yj)  =  0. 

For  estimation  and  testing  purposes,  we  need  to  have  n  >  q  +  1 .  Therefore,  the 
matrix  expression  (10.3)  has  the  following  typical  pattern: 


10.2.2  Least  Squares  Estimation  in  the  Fixed-x  Model 

If  the  first  assumption  holds,  we  have  E(y,)  =  /3q  +  P\xn  +  fcxn  +  •  •  •  +  PqXiq. 
We  seek  to  estimate  the  /Ts  and  thereby  estimate  E  (  v/ ) .  If  the  estimates  are  denoted 
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by  fa.  Pi, . . .  ,j3q,  then  E(yi)  =  Pq  +  P\Xn  +  Pixn  H - h  PqXiq.  However,  £(y,) 

is  usually  designated  y;.  Thus  y;  estimates  E(y not  y,-.  We  now  consider  the  least 
squares  estimates  of  the  P’s. 

The  least  squares  estimates  of  Po,  Pi, .. .  ,  Pq  minimize  the  sum  of  squares  of 
deviations  of  the  n  observed  y’s  from  their  “modeled”  values,  that  is,  from  their 
values  }'i  predicted  by  the  model.  Thus  we  seek  Pq,  P  \ ...  ■  .  pq  that  minimize 

SSE  =  =  ^Tfyy  -  yy)2 

/=1  i= 1 

n 

—  ^(yi  -  PO  -  PlXil  -  PlXil - PqXiqj2.  (10.4) 

1  =  1 

The  value  of  p  —  (Po,  P\,  . . .  ,  PqY  that  minimizes  SSE  in  (10.4)  is  given  by 


p  =  (X'X)-1X'y.  (10.5) 

In  (10.5),  we  assume  that  X'X  is  nonsingular.  This  will  ordinarily  hold  if  n  >  q  +  1 
and  no  Xj  is  a  linear  combination  of  other  x’s. 

In  expression  (10.5),  we  see  a  characteristic  pattern  similar  to  that  for  Pi  in  sim¬ 
ple  linear  regression  given  in  (3.11),  Pi  =  sxy/s^.  The  product  X'v  can  be  used  to 
compute  the  covariances  of  the  x’s  with  v.  The  product  X'X  can  be  used  to  obtain 
the  covariance  matrix  of  the  x’s,  which  includes  the  variances  and  covariances  of  the 
x’s  [see  the  comment  following  (10.16)  about  variances  and  covariances  involving 
the  fixed  x’s].  Since  X'X  is  typically  not  diagonal,  each  Pj  depends  on  sXjy  and  s~ 
as  well  as  the  relationship  of  xj  to  the  other  x’s. 

We  now  demonstrate  algebraically  that  fi  —  (X'X)_1X'y  in  (10.5)  minimizes 
SSE  (this  can  also  be  done  readily  with  calculus).  If  we  designate  the  ;th  row  of  X 
as  x'.  =  (1,  x;i,  Xj2, .  ■ .  ,  Xiq),  we  can  write  (10.4)  as 


SSE  =£(y/-x'p)2. 

/=i 

The  quantity  y,  —  x'  (i  is  the  /th  element  of  the  vector  y  —  Xfi.  Hence,  by  (2.33), 

SSE=  (y-Xp)'(y-XP).  (10.6) 

Let  b  be  an  alternative  estimate  that  may  lead  to  a  smaller  value  of  SSE  than  does  (i. 
We  add  X(/3  —  b)  to  see  if  this  reduces  SSE. 

SSE  =  [(y  -  X/3)  +  X(/3  -  b)]'[(y  -  Xp)  +  X(p  -  b)]. 
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We  now  expand  this  using  the  two  terms  y  —  X/3  and  X(/3  —  b)  to  obtain 

SSE  =  (y  -  X0)'(y  -  X/3)  +  [X(0  -  b)]'X(0  -  b)  +  2[X(0  -  b)]'(y  -  X/3) 

=  (y  -  X0)'(y  -  X/3)  +  (/3  -  b)'X'X(0  -  b)  +  2 (/3  -  b)'X'(y  -  X/3) 

=  (y  -  X0)'(y  -  X/3)  +  (0  -  b)'X'X(0  -  b)  +  2(0  -  b)'(X'y  -  X'X/3). 

The  third  term  vanishes  if  we  substitute  /J  =  (X'X)_1X'y  into  X'X/3.  The  second 
term  is  a  positive  definite  quadratic  form,  and  SSE  is  therefore  minimized  when 
b  =  /3.  Thus  no  value  of  b  can  reduce  SSE  from  the  value  given  by  (i.  For  a  review 
of  properties  of  fi  and  an  alternative  derivation  of  /3  based  on  the  assumption  that  y 
is  normally  distributed,  see  Rencher  (1998,  Chapter  7;  2000,  Chapter  7). 

10.2.3  An  Estimator  for  or2 

It  can  be  shown  that 

£(SSE)  =  o2[n  -  (q  +  1)]  =  a2(n  -q-  1).  (10.7) 

We  can  therefore  obtain  an  unbiased  estimator  of  a2  as 
,  SSE  1  .  , 

s2  =  - T  =  - ~(y  —  X/3)  (y  —  X0).  (10.8) 

n  —  q  —  1  n  —  q  —  1 

We  can  also  express  SSE  in  the  form 


SSE  =  y'y  -  0'X'y, 


(10.9) 


and  we  note  that  there  are  n  terms  in  y'y  and  q  +  1  terms  in  /3'X'y.  The  difference  is 
the  denominator  of  s2  in  (10.8).  Thus  the  degrees  of  freedom  (denominator)  for  SSE 
are  reduced  by  q  +  1 . 

The  need  for  an  adjustment  of  q  +  1  to  the  degrees  of  freedom  of  SSE  can  be 
illustrated  with  a  simple  random  sample  of  a  random  variable  y  from  a  population 
with  mean  /z  and  variance  a2.  The  sum  of  squares  Oz  ~  /-O2  has  n  degrees  of 
freedom,  whereas  ^.(y,-  —  v)2  has  n  —  1.  It  is  intuitively  clear  that 


X](>v  -  m)2 


i=l 


>  E 


-  J)7 


i=i 


because  y  fits  the  sample  better  than  //,  which  is  the  mean  of  the  population  but  not  of 
the  sample.  Thus  (squared)  deviations  from  y  will  tend  to  be  smaller  than  deviations 
from  /i.  In  fact,  it  is  easily  shown  that 
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n  n 

-  m)2  =  -  y  +  y  -  m)2 

i=  1  i= 1 

=  ^(>’i  -  y)2  +  «(y  -  m)2,  (10.10) 

i 

whence 

X)(y;  -  y)2  =  X)(y/  "  -  "O’  _  m)2- 

i  i 

Thus  JT  (y,-  —  y)2  is  expressible  as  a  sum  of  n  squares  minus  1  square,  which  corre¬ 
sponds  to  n  —  1  degrees  of  freedom.  More  formally,  we  have 


E 


-  y )2 


=  na 


2 


-  (n  —  l)a“. 


10.2.4  The  Model  Corrected  for  Means 

It  is  sometimes  convenient  to  “center”  the  x’s  by  subtracting  their  means,  x\  = 
1  xn/n,  x2  =  Yl’!=  l  xil/n,  and  so  on  pci,  x2, ...  ,  x q  are  the  means  of  the 
columns  of  X  in  (10.2)].  In  terms  of  centered  x’s,  the  model  for  each  y,  in  (10.1) 
becomes 


yt  =  a  +  P i(xn  -  x\)  +  Piixn  -  x2)-\ - h  Pq(xiq  -  xq)  +  e,-,  (10.11) 

where 

a  =  Po  +  P\X\  +  P2X2  -t - f  PqXq.  (10.12) 

To  estimate 


/  A  \ 

V  ^  ) 


pi  = 

we  use  the  centered  x’s  in  the  matrix 

Xc  = 

\  Xn[  X\  Xn2  X2  *  *  *  Xfiq  Xq  J 

where  xj  =  (xn,  Xj2, ...  ,  Xiq)  and  x' 

(10.5),  the  least  squares  estimate  of  Pi  is 

p  1  =  (x;xc)-'x;,y. 


Xq  y 

/  (xi  -  X)'  \ 

Xq 

— 

(X2  -  X)' 

Xq  ) 

(  (x„  -  x)'  ) 

T,*2.  ••• 

,xq).  Then  by 

(10.13) 


(10.14) 
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If  E(y )  =  /3q  +  P\x\  +  ■  ■  ■  +  Pqxq  is  evaluated  at  xi  =  x\,  X2  =  X2, . . .  ,  xq  =  xq, 
the  result  is  the  same  as  a  in  (10.12).  Thus,  we  estimate  a  by  y: 


a  —  y. 

In  other  words,  if  the  origin  of  the  x’s  is  shifted  to  x  =  (3c i,  3c2,  . . .  ,  xq)' ,  then  the 
intercept  of  the  fitted  model  is  y.  With  a  —  y,  we  obtain 

Po  =  a-  PlXl  -  P2X2 - PqXq  =  J  ~  J8jX  (10.15) 


as  an  estimate  of  Pq  in  (10.12).  Together,  the  estimators  fio  and  f}\  in  (10.15)  and 
(10.14)  are  the  same  as  the  usual  least  squares  estimator  /3  —  (X'X)- 1  X'y  in  (10.5). 

We  can  express  0\  in  (10.14)  in  terms  of  sample  variances  and  covariances.  The 
overall  sample  covariance  matrix  of  y  and  the  x’s  is 


^  syy 

Sy\ 

Sy2 

■  '  Syq 

s  = 

S\y 

^11 

S12  ■  ' 

\  sqy 

sq\ 

Sql  ■ 

"  sqq  . 

(10.16) 


where  syy  is  the  variance  of  y,  syj  is  the  covariance  of  y  and  xj,  Sjj  is  the  vari¬ 
ance  of  xj ,  Sjk  is  the  covariance  of  xj  and  Xk,  and  s'vl  =  (ivi,  sy 2, ...  ,  syq).  These 
sample  variances  and  covariances  are  mathematically  equivalent  to  analogous  formu¬ 
las  (3.23)  and  (3.25)  for  random  variables,  where  the  sample  variances  and  covari¬ 
ances  were  estimates  of  population  variances  and  covariances.  However,  here  the  x’s 
are  considered  to  be  constants  that  remain  fixed  from  sample  to  sample,  and  a  for¬ 
mula  such  as  m  1  =  1  (x,i  —  3c  1  )2/(«  —  1)  summarizes  the  spread  in  the  n  values 

of  xi  but  does  not  estimate  a  population  variance. 

To  express  (i  1  in  terms  of  SAA  and  syx  in  (10.16),  we  note  first  that  the  diagonal 
elements  of  X('.X,  are  corrected  sums  of  squares.  For  example,  in  the  second  diagonal 
position,  we  have 


n 

J^(x/2  -  X2)2  =  (n  -  1)522- 

i=l 

The  off-diagonal  elements  of  X',XC  are  analogous  corrected  sums  of  products;  for 
example,  the  element  in  the  (1,2)  position  is 

n 

Ffel  -  Xl)(x/2  -  X2)  =  (n  -  1)512. 

1=1 


Thus 


-X'xr  =  SA 


n  —  1 


(10.17) 
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Similarly, 


j’Xf-y  =  Syx,  (10.18) 

even  though  y  has  not  been  centered.  The  second  element  of  X£.y,  for  example,  is 
J2i  (xi2  —  x2 )yi,  which  is  equal  to  (n  —  1  )s2y: 

n 

(n  -  l)s2y  =  y^Jxn  -  X2)(}’i  -  y) 
i=  1 

=  y^Jxn  -  X2)yi  -  y^jxn  -  X2 )y 
i  i 

=  yZ(Xj2  -  X2)}'i, 
i 

since 

J2(xn  -  x2)y  =  0.  (10.19) 


Now,  multiplying  and  dividing  by  n  —  1  in  (10.14),  we  obtain 

Ky 


Pi  =  (n-l)(X'cXcy 


XiXr\-'  X'cy 


—  SXx  S.VX 


77  —  1  \n  —  l 
[by  (10.17)  and  (10.18)], 

and  substituting  this  in  (10.15)  gives 

Po  =  «  -  M*  =  sy.vS«x- 


n  —  1 


(10.20) 


(10.21) 


10.2.5  Hypothesis  Tests 

In  this  section,  we  review  two  basic  tests  on  the  P’s.  For  other  tests  and  confidence 
intervals,  see  Rencher  (1998,  Section  7.2.4;  2000,  Sections  8.4-8. 7).  In  order  to 
obtain  F-tests,  we  assume  that  y  is  N„  (X/3,  cr2I). 

10.2.5a  Test  of  Overall  Regression 

The  overall  regression  hypothesis  that  none  of  the  x’s  predict  y  can  be  expressed 
as  Ho:  pi  —  0,  since  /3j  =  (P i,  P2, . . .  ,  Pq).  We  do  not  include  Po  =  0  in  the 
hypothesis  so  as  not  to  restrict  y  to  have  an  intercept  of  zero. 

We  can  write  SSE  =  y'v  —  /J'X'y  in  (10.9)  in  the  form 


y'y  =  (y'y  -  P'x'y)  +  P'x'  y, 


(10.22) 
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which  partitions  y'y  into  a  part  due  to  /3  and  a  part  due  to  deviations  from  the  fitted 
model. 

To  correct  y  for  its  mean  and  thereby  avoid  inclusion  of  /So  =  0,  we  subtract  n~y2 
from  both  sides  of  (10.22)  to  obtain 

y'y  -  ny 2  =  (y'y  -  /3'X'y)  +  (/3'X'y  -  nf)  (10.23) 

=  SSE+SSR, 


where  y'y  —  ny 2  =  JT  (y,  —  y)2  is  the  total  sum  of  squares  adjusted  for  the  mean 
and  SSR  =  /3'X'y  —  ny2  is  the  overall  regression  sum  of  squares  adjusted  for  the 
intercept. 

We  can  test  Hq  :  /3i  =  0  by  means  of 


SSR /q 

SSE /(n-q-  I)’ 


(10.24) 


which  is  distributed  as  Fqn-q-\  when  Hq.  fi\  =  0  is  true.  We  reject  Hq  if  F  > 

F(x,q,n—q,  —  1  ■ 


10.2.5b  Test  on  a  Subset  of  the  (i’s 

In  an  attempt  to  simplify  the  model,  we  may  wish  to  test  the  hypothesis  that  some  of 
the  /Ts  are  zero.  For  example,  in  the  model 

y  =  Po  +  f\x\  +  P2X2  +  foxj  +  P4X2  +  P5X1X2  +  e, 


we  may  be  interested  in  the  hypothesis  Hq  :  =  @4  =  /I5  =  0.  If  Hq  is  true,  the 

model  is  linear  in  x\  and  xi  ■  In  other  cases,  we  may  want  to  ascertain  whether  a 
single  jJ>  j  can  be  deleted. 

For  convenience  of  exposition,  let  the  /Ts  that  are  candidates  for  deletion  be  re¬ 
arranged  to  appear  last  in  /3  and  denote  this  subset  of  /Ts  by  /3 d,  where  d  reminds 
us  that  these  /Ts  are  to  be  deleted  if  Hq:  (Id  =  0  is  accepted.  Let  the  subset  to  be 
retained  in  the  reduced  model  be  denoted  by  Thus  /3  is  partitioned  into 


Let  h  designate  the  number  of  parameters  in  (id-  Then  there  are  q  +  1  —  h  parameters 
in  pr. 

To  test  the  hypothesis  Hq\  (id  —  0,  we  fit  the  full  model  containing  all  the  /Ts 
in  /3  and  then  fit  the  reduced  model  containing  only  the  /Ts  in  (ir.  Let  X,.  be  the 
columns  of  X  corresponding  to  /3, .  Then  the  reduced  model  can  be  written  as 


y  =  X,  / 3,  +  e. 


(10.25) 
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and  /3,  is  estimated  by  pr  =  (X'.X,)  1  Xj.y.  To  compare  the  fit  of  the  full  model  and 
the  reduced  model,  we  calculate 


p'X'y-p'Xy,  (10.26) 

where  jS'X'y  is  the  regression  sum  of  squares  from  the  full  model  and  /J'.Xj.y  is  the 
regression  sum  of  squares  for  the  reduced  model.  The  difference  in  (10.26)  shows 
what  fi,i  contributes  “above  and  beyond”  p, .  We  can  test  Ho :  ft,/  =  0  with  an  F- 
statistic: 


r_  (P'X'y  —  P'XrJ)/ h 
(y'y- P'X'y) /(n-q-  1) 

(SSR/  —  SSRr)/h  MSR 
~  SSE f/(n-q  -  1)  “  MSE’ 

where  SSR/  =  P’X’y  and  SSR,  =  ^.Xj.y.  The  F- statistic  in  (10.27)  and  (10.28)  is 
distributed  as  Fii  n-q-\  if  Hq  is  true.  We  reject  Ho  if  F  >  Fa  h,n-q-l- 

The  test  in  (10.27)  is  easy  to  carry  out  in  practice.  We  fit  the  full  model  and  obtain 
the  regression  and  error  sums  of  squares  P'X'y  and  y'y  —  p'X'y,  respectively.  We 
then  fit  the  reduced  model  and  obtain  its  regression  sum  of  squares  j8j.Xj.y  to  be 
subtracted  from  fi'X  y.  If  a  software  package  gives  the  regression  sum  of  squares  in 
corrected  form,  this  can  readily  be  used  to  obtain  0'Xfy  —  /S'.X'  y,  since 

0'X'y  -  ny2  -  (p’X-y  -  ny2)  =  P'X  y  -  fiXv- 

Alternatively,  we  can  obtain  fi'X'y  —  )8'.Xj.y  as  the  difference  between  error  sums  of 
squares  for  the  two  models: 


(10.27) 

(10.28) 


SSE,.  -  SSE /  =  y'y  -  ftx^y  -  (y'y  -  p'X'y) 

=  p'x'y-pxry- 


A  test  for  an  individual  above  and  beyond  the  other  0’s  is  readily  obtained 
using  (10.27).  To  test  Hq  :  ft  j  =  0,  we  arrange  ft  j  last  in  /3, 


where  p,  =  ( fto ,  Pi, ...  ,  ftq-\ )'  contains  all  the  ft’s  except  ftj.  By  (10.27),  the  test 
statistic  is 


P'X'y-PXy 

SSE  f/(n-q-  1)’ 


(10.29) 
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which  is  F\  n-q-\.  Note  that  h  =  1.  The  test  of  Ho :  /3j  —  0  by  the  F-statistic  in 
(10.29)  is  called  a  partial  F-test.  A  detailed  breakdown  of  the  effect  of  each  variable 
in  the  presence  of  the  others  is  given  by  Rencher  (1993;  2000,  Section  10.5). 

Since  the  /-’-statistic  in  (10.29)  has  1  and  n  —  q  —  1  degrees  of  freedom,  it  is  the 
square  of  a  /-statistic.  The  /-statistic  equivalent  to  (10.29)  is 


where  gjj  is  the  jth  diagonal  element  of  (X'X)  1  and  s  —  ySSE //(«  —  q  —  1) 
(Rencher  2000,  Section  8.5.1). 


10.2.6  R2  in  Fixed-*  Regression 

The  proportion  of  the  (corrected)  total  variation  in  the  y’s  that  can  be  attributed  to 
regression  on  the  x’s  is  denoted  by  R2: 


R2 


regression  sum  of  squares 
total  sum  of  squares 

P'X'y  -  ny2 
y'y  -  ny2 


(10.30) 


The  ratio  R2  is  called  the  coefficient  of  multiple  determination,  or  more  commonly 
the  squared  multiple  correlation.  The  multiple  correlation  R  is  defined  as  the  positive 
square  root  of  R2. 

The  F-test  for  overall  regression  in  (10.24)  can  be  expressed  in  terms  of  R2  as 


n  —  q  —  1  R2 

F  = - - - 

q  l  —  R2 

For  the  reduced  model  (10.25),  R2  can  be  written  as 

2  iKKy  -  nf 
y'y  -  ny 2 


(10.31) 


(10.32) 


Then  in  terms  of  R2  and  R2,  the  full  and  reduced  model  test  in  (10.27)  for 
Ho'-  If  I  =  0  becomes 


(R2  -  R2)/h 
(1  _  R2)/in  -q-l) 


(10.33) 


[see  (11.36)]. 

We  can  express  R2  in  terms  of  sample  variances,  covariances,  and  correlations: 


MULTIPLE  REGRESSION:  FIXED  x’S 


333 


where  syy ,  syx ,  and  SlA  are  defined  in  (10.16)  and  ryx  and  Rvl  are  from  an  analogous 
partitioning  of  the  sample  correlation  matrix  of  y  and  the  x’s: 


/  1 

ry  l  ry  2  • 

ryq 

R  = 

r\y 

1  rn  ■ 

nq 

V  rqy 

rq  l  rq  2  • 

1 

1  F  \ 

y*  ) 
ryx  R.v.v  J  ' 


(10.35) 


10.2.7  Subset  Selection 

In  practice,  one  often  has  more  x’s  than  are  needed  for  predicting  y.  Some  of  them 
may  be  redundant  and  could  be  discarded.  In  addition  to  logistical  motivations  for 
deleting  variables,  there  are  statistical  incentives;  for  example,  if  an  x  is  deleted  from 
the  fitted  model,  the  variances  of  the  ft/’  s  and  of  the  y;  ’s  are  reduced.  Various  aspects 
of  model  validation  are  reviewed  by  Rencher  (2000,  Section  7.9  and  Chapter  9). 

The  two  most  popular  approaches  to  subset  selection  are  to  (1)  examine  all  pos¬ 
sible  subsets  and  (2)  use  a  stepwise  technique.  We  discuss  these  in  the  next  two 
sections. 


10.2.7a  All  Possible  Subsets 

The  optimal  approach  to  subset  selection  is  to  examine  all  possible  subsets  of  the  x’s. 
This  may  not  be  computationally  feasible  if  the  sample  size  and  number  of  variables 
are  large.  Some  programs  take  advantage  of  algorithms  that  find  the  optimum  subset 
of  each  size  without  examining  all  of  the  subsets  [see,  for  example,  Furnival  and 
Wilson  (1974)]. 

We  discuss  three  criteria  for  comparing  subsets  when  searching  for  the  best  subset. 
To  conform  with  established  notation  in  the  literature,  the  number  of  variables  in  a 
subset  is  denoted  by  p  —  1,  so  that  with  the  inclusion  of  an  intercept,  there  are  p 
parameters  in  the  model.  The  corresponding  total  number  of  available  variables  from 
which  a  subset  is  to  be  selected  is  denoted  by  k  —  1,  with  k  parameters  in  the  model. 

1.  R1 2.  By  its  definition  in  (10.30)  as  the  proportion  of  total  (corrected)  sum  of 
squares  accounted  for  by  regression,  R 2  is  clearly  a  measure  of  model  fit.  The  sub¬ 
script  p  is  an  index  of  the  subset  size,  since  it  indicates  the  number  of  parameters 
in  the  model,  including  an  intercept.  However,  R2  does  not  reach  a  maximum  for 
any  value  of  p  less  than  k  because  it  cannot  decrease  when  a  variable  is  added  to  the 
model.  The  usual  procedure  is  to  find  the  subset  with  largest  R2  for  each  of  p  =  2, 
3, . . .  ,  k  and  then  choose  a  value  of  p  beyond  which  the  increases  in  R2  appear  to 
be  unimportant.  This  judgment  is,  of  course,  subjective. 

2.  s2 .  Another  useful  criterion  is  the  variance  estimator  for  each  subset  as  defined 
in  (10.8): 


2 

P 


SSEp 

n  —  p 


(10.36) 
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For  each  of  p  =  2,  3, . . .  ,k,  we  find  the  subset  with  smallest  s2.  If  k  is  fairly  large, 
a  typical  pattern  as  p  approaches  k  is  for  the  minimal  s2  to  decrease  to  an  overall 
minimum  less  than  sj.  and  then  increase.  The  minimum  value  of  sjt  can  be  less  than 
si  if  the  decrease  in  SSEp  with  an  additional  variable  does  not  offset  the  loss  of 
a  degree  of  freedom  in  the  denominator.  It  is  often  suggested  that  the  researcher 
choose  the  subset  with  absolute  minimum  s2p.  However,  as  Hocking  (1976,  p.  19) 
notes,  this  procedure  may  fit  some  noise  unique  to  the  sample  and  thereby  include 
one  or  more  extraneous  predictor  variables.  An  alternative  suggestion  is  to  choose  p 
such  that  min;,  ,y2  —  si  or,  more  precisely,  choose  the  smallest  value  of  p  such  that 
min/;  s2p  <  s\,  since  there  will  not  be  a  p  <  k  such  that  minp  .y2  is  exactly  equal 
to  si 

3.  Cp.  The  Cp  criterion  is  due  to  Mallows  (1964,  1973).  In  the  following  devel¬ 
opment,  we  follow  Myers  (1990,  pp.  180-182).  The  expected  squared  error ,  £[>7  — 
£(y,-)]2,  is  used  in  formulating  the  Cp  criterion  because  it  incorporates  a  variance 
component  and  a  bias  component.  The  goal  is  to  find  a  model  that  achieves  a  good 
balance  between  the  bias  and  variance  of  the  fitted  values  yt .  Bias  arises  when  the  >7 
values  are  based  on  an  incorrect  model,  in  which  £  (  v, )  yk  £  (y; ) .  If  >7  were  based 
on  the  correct  model,  so  that  £(y,-)  —  E(yi),  then  £[v,  —  £’(>7)] 2  would  be  equal 
to  var(y;).  In  general,  however,  as  we  examine  many  competing  models,  for  various 
values  of  p,  y\  is  not  based  on  the  correct  model,  and  we  have  (see  Problem  10.4) 

E[)’i  -  E(yj)]2  =  E[yt  -  £(>7)  +  £(>7)  -  £(y;)]2 

=  Ely  1  -  Eiyt )]2  +  [£(y,)  -  E(yi )]2  (10.37) 

=  var(y;)  +  (bias  in  y/)2-  (10.38) 

For  a  given  value  of  p,  the  total  expected  squared  error  for  the  n  observations  in  the 
sample,  standardized  by  dividing  by  cr2,  becomes 

j  n  1  n  1  n 

—  ^2  E[yi  -  E (y,)]2  =  —  ^  var(y;)  +  —  ^(bias  in  y,)2.  (10.39) 

0  1  =  1  !  =  1  !  =  1 

Before  defining  Cp  as  an  estimate  of  (10.39),  we  can  achieve  some  simplification. 
We  first  show  that  var(y, )/cr2  is  equal  to  p.  Let  the  model  for  all  n  observations 
be  designated  by 


y  =  XpPp  +  e. 

We  assume  that,  in  general,  this  prospective  model  is  underspecified  and  that  the  true 
model  (which  produces  cr2)  contains  additional  ft’s  and  additional  columns  of  the  X 
matrix.  If  we  designate  the  i  th  row  of  Xp  by  xT ,  then  the  first  term  on  the  right  side 
of  (10.39)  becomes  (see  also  Problem  10.5) 
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=  ^2  Ex^[(t2(XW1]V' 

i 

[by  (3.70)] 

=  tr  [Xp(X'pXprlX'p\ 

[by  (3.65)] 

(10.40) 

=  tr  [(X^Xpr'X'pX,] 

[by  (2.97)] 

=  tr(I;,)  =  p. 

(10.41) 

It  can  be  shown  (Myers  1990,  pp.  178-179)  that 

n 

^(bias  in  yt)2  —  (n  -  p)E{s 2  -  a2).  (10.42) 

i=  1 

Using  (10.41)  and  (10.42),  the  final  simplified  form  of  the  (standardized)  total 
expected  squared  error  in  (10.39)  is 

1  n  _ 

^  Y  E[y>  -  E{y^2  =  p  +  -^rE{sl  - a2)-  (10-43) 


In  practice,  cr2  is  usually  estimated  by  si,  the  MSE  from  the  full  model.  We  thus 
estimate  (10.43)  by 


An  alternative  form  is 


Cp  —  p  +  (n  -  p) 


Cp  — 


-  (n  -  2 p). 


(10.44) 


(10.45) 


In  (10.44),  we  see  that  if  the  bias  is  small  for  a  particular  model,  Cp  will  be  close 
to  p.  For  this  reason,  the  line  Cp  —  p  is  commonly  plotted  along  with  the  Cp  values 
of  several  candidate  models.  We  look  for  small  values  of  Cp  that  are  near  this  line. 


In  a  Monte  Carlo  study,  Hilton  (1983)  compared  several  subset  selection  criteria 
based  on  MSE/;  and  Cp.  The  three  best  procedures  were  to  choose  (1)  the  subset  with 
the  smallest  p  such  that  Cp  <  p,  (2)  the  subset  with  the  smallest  p  such  that  5  \  <  sl 
and  (3)  the  subset  with  minimum  s2.  The  first  of  these  was  found  to  give  best  results 
overall,  with  the  second  method  close  behind.  The  third  method  performed  best  in 
some  cases  where  k  was  small. 


10.2.7b  Stepwise  Selection 

For  many  data  sets,  it  may  be  impractical  to  examine  all  possible  subsets,  even  with 
an  efficient  algorithm  such  as  that  of  Furnival  and  Wilson  (1974).  In  such  cases,  we 
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can  use  the  familiar  stepwise  approach,  which  is  widely  available  and  has  virtually 
no  limit  as  to  the  number  of  variables  or  observations.  A  related  stepwise  technique 
was  discussed  in  Sections  6.1 1.2  and  8.9  in  connection  with  selection  of  dependent 
variables  to  separate  groups  in  a  MANOVA  or  discriminant  analysis  setting.  In  this 
section,  we  are  concerned  with  selecting  the  independent  variables  (x’s)  that  best 
predict  the  dependent  variable  (y)  in  regression. 

We  first  review  the  forward  selection  procedure,  which  typically  uses  an  F-test  at 
each  step.  At  the  first  step,  y  is  regressed  on  each  Xj  alone,  and  the  x  with  the  largest 
F- value  is  “entered”  into  the  model.  At  the  second  step,  we  search  for  the  variable 
with  the  largest  partial  F- value  for  testing  the  significance  of  each  variable  in  the 
presence  of  the  variable  first  entered.  Thus,  if  we  denote  the  first  variable  to  enter  as 
x\ ,  then  at  the  second  step  we  calculate  the  partial  f ’-statistic 

_  MSR(xj|.ri) 

MSEOcy,  XI ) 

for  each  j  1  and  choose  the  variable  that  maximizes  F,  where  MSR  =  (SSR /  — 
SSR ,  )/h  andMSE  =  SSEf/(n—q  —  l)  are  the  mean  squares  for  regression  and  error, 
respectively,  as  in  (10.28).  In  this  case,  SSR/  =  SSR(.vi,  xj )  and  SSR,-  =  SSR(.ri). 
Note  also  that  h  =  1  because  only  one  variable  is  being  added,  and  MSE  is  calculated 
using  only  the  variable  already  entered  plus  the  candidate  variable.  This  procedure 
continues  at  each  step  until  the  largest  partial  F  for  an  entering  variable  falls  below 
a  preselected  threshold  F-value  or  until  the  corresponding  /-value  exceeds  some 
predetermined  level. 

The  stepwise  selection  procedure  similarly  seeks  the  best  variable  to  enter  at  each 
step.  Then  after  a  variable  has  entered,  each  of  the  variables  previously  entered  is 
examined  by  a  partial  F-test  to  see  if  it  is  no  longer  significant  and  can  be  dropped 
from  the  model. 

The  backward  elimination  procedure  begins  with  all  x’s  in  the  model  and  deletes 
one  at  a  time.  The  partial  F-statistic  for  each  variable  in  the  presence  of  the  others 
is  calculated,  and  the  variable  with  smallest  F  is  eliminated.  This  continues  until  the 
smallest  F  at  some  step  exceeds  a  preselected  threshold  value. 

Since  these  sequential  methods  do  not  examine  all  subsets,  they  will  often  fail 
to  find  the  optimum  subset,  especially  if  k  is  large.  However,  Rjt,  sj>,  or  Cp  may 
not  differ  substantially  between  the  optimum  subset  and  the  one  found  by  stepwise 
selection.  These  sequential  methods  have  been  popular  for  at  least  a  generation,  and  it 
is  very  likely  they  will  continue  to  be  used,  even  though  increased  computing  power 
has  put  the  optimal  methods  within  reach  for  larger  data  sets. 

There  are  some  possible  risks  in  the  use  of  stepwise  methods.  The  stepwise  proce¬ 
dure  may  fail  to  detect  a  true  predictor  (an  xj  for  which  ft  j  ^  0)  because  s~}  is  biased 
upward  in  an  underspecified  model,  thus  artificially  reducing  the  partial  F-value.  On 
the  other  hand,  a  variable  that  is  not  a  true  predictor  of  y  (an  xj  for  which  ft  j  —  0) 
may  enter  because  of  chance  correlations  in  a  particular  sample.  In  the  presence 
of  such  "noise”  variables,  the  partial  F-statistic  for  the  entering  variable  does  not 
have  an  F-distribution  because  it  is  maximized  at  each  step.  The  calculated  /-values 
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become  optimistic.  This  problem  intensifies  when  the  sample  size  is  relatively  small 
compared  to  the  number  of  variables.  Rencher  and  Pun  (1980)  found  that  in  such 
cases  some  surprisingly  large  values  of  R 2  can  occur,  even  when  there  is  no  relation¬ 
ship  between  y  and  the  x’s  in  the  population.  In  a  related  study,  Flack  and  Chang 
(1987)  included  x’s  that  were  authentic  contributors  as  well  as  noise  variables.  They 
found  that  “for  most  samples,  a  large  percentage  of  the  selected  variables  is  noise, 
particularly  when  the  number  of  candidate  variables  is  large  relative  to  the  number 
of  observations.  The  adjusted  R 2  of  the  selected  variables  is  highly  inflated”  (p.  84). 


10.3  MULTIPLE  REGRESSION:  RANDOM  x’s 

In  Section  10.2,  it  was  assumed  that  the  x’s  were  fixed  and  would  have  the  same 
values  if  another  sample  were  taken;  that  is,  the  same  X  matrix  would  be  used  each 
time  a  y  vector  was  observed.  However,  many  regression  applications  involve  x’s 
that  are  random  variables. 

Thus  in  the  random-x  case,  the  values  of  xi,  xi,  . . .  ,  xq  are  not  under  the  control 
of  the  experimenter.  They  occur  randomly  along  with  y.  On  each  subject  we  observe 

y,x  i,x2, ...  ,  xq. 

If  we  assume  that  (y,  xj,  xj ,  . . .  ,  xq)  has  a  multivariate  normal  distribution,  then 
fi.  R2,  and  the  F- tests  have  the  same  formulation  as  in  the  fixed-x  case  [for  details, 
see  Rencher  (1998,  Section  7.3;  2000,  Section  10.4)].  Thus  with  the  multivariate 
normal  assumption,  we  can  proceed  with  estimation  and  testing  the  same  way  in  the 
random-x  case  as  with  fixed  x’s. 


10.4  MULTIVARIATE  MULTIPLE  REGRESSION:  ESTIMATION 

In  this  section  we  extend  the  estimation  results  of  Sections  10.2.2-10.2.4  to  the  mul¬ 
tivariate  y  case.  We  assume  the  x’s  are  fixed. 

10.4.1  The  Multivariate  Linear  Model 

We  turn  now  to  the  multivariate  multiple  regression  model,  where  multivariate  refers 
to  the  dependent  variables  and  multiple  pertains  to  the  independent  variables.  In 
this  case,  several  y’s  are  measured  corresponding  to  each  set  of  x’s.  Each  of  vi, 
y2,  . . .  ,  y  p  is  to  be  predicted  by  all  of  xi,  X2, . . .  ,xq. 

The  n  observed  values  of  the  vector  of  y’s  can  be  listed  as  rows  in  the  following 
matrix: 


y  ii 

y  12  •  ■ 

■  ■  yip 

yn 

yn  •  ■ 

■  ■  yiP 

>’nl 

yni 

ynp 

(y'}  \ 

?2 


Y  = 
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Thus  each  row  of  Y  contains  the  values  of  the  p  dependent  variables  measured  on  a 
subject.  Each  column  of  Y  consists  of  the  n  observations  on  one  of  the  p  variables 
and  therefore  corresponds  to  the  y  vector  in  the  (univariate)  regression  model  (10.3). 

The  n  values  of  x\,  *2,  ■  ■ .  ,xq  can  be  placed  in  a  matrix  that  turns  out  to  be  the 
same  as  the  X  matrix  in  the  multiple  regression  formulation  in  Section  10.2.1: 

1  X\  1  X\2 

1  *21  *22 

1  Xn\  Xu  2 

We  assume  that  X  is  fixed  from  sample  to  sample. 

Since  each  of  the  p  y’s  will  depend  on  the  *’s  in  its  own  way,  each  column  of 
Y  will  need  different  ft's.  Thus  we  have  a  column  of  ft’s  for  each  column  of  Y, 

and  these  columns  form  a  matrix  B  =  (/3i .  /EE . (ip).  Our  multivariate  model  is 

therefore 


Y  =  XB  +  S, 

where  Y  is  n  x  p.  X  is  n  x  (q  +  1),  and  B  is  (q  +  1)  x  p.  The  notation  H  (the 
uppercase  version  of  £;)  is  adopted  here  because  of  its  resemblance  to  s. 

We  illustrate  the  multivariate  model  with  p  —  2,  q  =  3: 


yn  yi2 

yi\  yn 

yni  yn2 


1  *11  *12  *13  \  /  ft{)]  ft02  \  (  £11  £12 

1  *21  *22  *23  ftn  ft]2  £21  £22 

:  :  :  :  ftn  P22  +  \  \ 

1  *„1  *„2  *«3  /  V  ^31  /  \  £«1  £«2 


The  model  for  the  hist  column  of  Y  is 

1  *11  *12  *13  \  /  ftQl 

1  *21  *22  *23  ftn 

:  :  :  :  ^21 

1  xn\  *„2  *«3  /  V  ^31 

and  for  the  second  column,  we  have 


By  analogy  with  the  univariate  case  in  Section  10.2.1,  additional  assumptions  that 
lead  to  good  estimates  are  as  follows: 
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1.  E( Y)  =  XB  or  E (H)  =  O. 

2.  cov(y,)  =  X  for  all  i  —  1,2,...  ,  n,  where  y.  is  the  ith  row  of  Y. 

3.  cov(y yj)  =  O  for  all  i  /  j. 

Assumption  1  states  that  the  linear  model  is  correct  and  that  no  additional  x’s  are 
needed  to  predict  the  y’s.  Assumption  2  asserts  that  each  of  the  n  observation  vec¬ 
tors  (rows)  in  Y  has  the  same  covariance  matrix.  Assumption  3  declares  that  obser¬ 
vation  vectors  (rows  of  Y)  are  uncorrelated  with  each  other.  Thus  we  assume  that 
the  y’s  within  an  observation  vector  (row  of  Y)  are  correlated  with  each  other  but 
independent  of  the  y’s  in  any  other  observation  vector. 

The  covariance  matrix  X  in  assumption  2  contains  the  variances  and  covariances 
of  >’;  i ,  y’i  2 1  ■  •  •  ,  yip  in  any  y,- : 


(  an 

0\2  ■  ■ 

■  ■  o\p 

\ 

x  = 

02 1 

022  ■  ' 

■  '  02p 

V 

Op2  '  ' 

■  ■  opp 

/ 

The  covariance  matrix  cov(y,-,  y;)  =  O  in  assumption  3  contains  the  covariances  of 
each  of  yn,  yn, ...  ,  yip  with  each  of  y;i,  y_/2, . . .  ,  y  jp : 


<  cov(y,i,  yji) 

cov(y(1,  yJ2)  ■■ 

■■  cov(y,i,  yjp)  ^ 

(  o 

0  ■■ 

■  0  \ 

cov(y,-2,  yj  l) 

cov(y,-2,  yj2)  ■■ 

■■  cov(y/2,  yjp) 

— 

0 

0  ■■ 

■  ■  0 

\  cov(y,p,  yji) 

co  v(yip,yj2)  ■■ 

■■  co v(yip,yjp)  / 

l  0 

0  ■■ 

•  0  / 

10.4.2  Least  Squares  Estimation  in  the  Multivariate  Model 

By  analogy  with  the  univariate  case  in  (10.5),  we  estimate  B  with 

B  =  (X'Xr'X'Y.  (10.46) 

We  call  B  the  least  squares  estimator  for  B  because  it  “minimizes”  E  =  B'H.  a 
matrix  analogous  to  SSE: 

E  -  H'S  =  (Y  -  XB)'(Y  -  XB). 

The  matrix  B  minimizes  E  in  the  following  sense.  If  we  let  Bo  be  an  estimate  that 
may  possibly  be  better  than  B  and  add  XB  —  XBo  to  Y  —  XB,  we  find  that  this  adds  a 
positive  definite  matrix  to  E  =  (Y  —  XB)' (Y  —  XB)  (Rencher  1998,  Section  7.4.2). 
Thus  we  cannot  improve  on  B.  The  least  squares  estimate  B  also  minimizes  the 
scalar  quantities  tr(Y  -  XB)'(Y  -  XB)  and  |(Y  -  XB)'(Y  -  XB)|.  Note  that  by 
(2.98)  tr(Y  -  XB)'(Y  -  XB)  =  E;=l  Ej=  1  e Ij • 
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We  noted  earlier  that  in  the  model  Y  =  XB+H,  there  is  a  column  of  B  correspond¬ 
ing  to  each  column  of  Y;  that  is,  each  yj,  j  =  1,  2,  . . .  ,  p,  is  predicted  differently 
by  x\,  a'2 ,  . . .  ,  xq.  (This  is  illustrated  in  Section  10.4.1  for  p  =  2.)  In  the  estimate 
B  =  (X'X)_1X'Y,  we  have  a  similar  pattern.  The  matrix  product  (X'X)~ 1 X'  is  mul¬ 
tiplied  by  each  column  of  Y  [see  (2.48)].  Thus  the  jth  column  of  B  is  the  usual  least 
squares  estimate  /3  for  the  /th  dependent  variable  yj.  To  give  this  a  more  precise 
expression,  let  us  denote  the  p  columns  of  Y  by  y(i),  y(2), . . .  ,  \(p)  to  distinguish 
them  from  the  n  rows  y  Li  —  1,2,...  ,  n.  Then 


B  =  (X'X)-!X'Y  =  (X'X)“1X'(y(1),  y(2), . . .  ,  y(p)) 

=  [(X'X)-1^!),  (X,X)_1X,y(2), . . .  ,  (X/X)_1X,y(p)] 

=  [P(l),P(2),...  ,P(p)l  (10.47) 

Example  10.4.2.  The  results  of  a  planned  experiment  involving  a  chemical  reaction 
are  given  in  Table  10.1  (Box  and  Youle  1955). 

The  input  (independent)  variables  are 


xi  =  temperature,  x2  =  concentration,  x3  =  time. 


Table  10.1.  Chemical  Reaction  Data 


Experiment 

Number 

Yield  Variables 

Input  Variables 

yi 

T2 

,V3 

Xl 

*2 

*3 

1 

41.5 

45.9 

11.2 

162 

23 

3 

2 

33.8 

53.3 

11.2 

162 

23 

8 

3 

27.7 

57.5 

12.7 

162 

30 

5 

4 

21.7 

58.8 

16.0 

162 

30 

8 

5 

19.9 

60.6 

16.2 

172 

25 

5 

6 

15.0 

58.0 

22.6 

172 

25 

8 

7 

12.2 

58.6 

24.5 

172 

30 

5 

8 

4.3 

52.4 

38.0 

172 

30 

8 

9 

19.3 

56.9 

21.3 

167 

27.5 

6.5 

10 

6.4 

55.4 

30.8 

111 

27.5 

6.5 

11 

37.6 

46.9 

14.7 

157 

27.5 

6.5 

12 

18.0 

57.3 

22.2 

167 

32.5 

6.5 

13 

26.3 

55.0 

18.3 

167 

22.5 

6.5 

14 

9.9 

58.9 

28.0 

167 

27.5 

9.5 

15 

25.0 

50.3 

22.1 

167 

27.5 

3.5 

16 

14.1 

61.1 

23.0 

111 

20 

6.5 

17 

15.2 

62.9 

20.7 

111 

20 

6.5 

18 

15.9 

60.0 

22.1 

160 

34 

7.5 

19 

19.6 

60.6 

19.3 

160 

34 

7.5 
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The  yield  (dependent)  variables  are 


y  i  =  percentage  of  unchanged  starting  material, 
V2  =  percentage  converted  to  the  desired  product, 
j3  =  percentage  of  unwanted  by-product. 


Using  (10.46),  the  least  squares  estimator  B  for  the  regression  of  (yi .  >’2-  >’3)  on 
(x\ ,  X2,  A3)  is  given  by 


B  =  (X'Xr'X'Y 

/  332.11  -26.04 

-1.55  .40 

-1.42  .29 

v  -2.24  1.03 


-164.08  \ 
.91 
.90 


1.15  / 


Note  that  the  hrst  column  of  B  gives  Pq,  Pi,  Pi,  Pi  for  regression  of  y\  on  X] ,  X2,  xy 
the  second  column  of  B  gives  ySo,  Pi ,  Pi,  Pi  for  regression  of  yo  on  xi ,  xi,  xi,  and  so 

on.  □ 


10.4.3  Properties  of  Least  Squares  Estimators  B 

The  least  squares  estimator  B  can  be  obtained  without  imposing  the  assumptions 
E(  y)  =  XB,  cov(y,  )  =  2,  and  cov(y,-,  yj )  —  O.  However,  when  these  assumptions 
hold,  B  has  the  following  properties: 

1.  The  estimator  B  is  unbiased,  that  is,  E{ B)  =  B.  This  means  that  if  we  took 
repeated  random  samples  from  the  same  population,  the  average  value  of  B 
would  be  B. 

2.  The  least  squares  estimators  pjt  in  B  have  minimum  variance  among  all  pos¬ 
sible  linear  unbiased  estimators.  This  result  is  known  as  the  Gauss-Markov 
theorem.  The  restriction  to  unbiased  estimators  is  necessary  to  exclude  trivial 
estimators  such  as  a  constant,  which  has  variance  equal  to  zero,  but  is  of  no 
interest.  This  minimum  variance  property  of  least  squares  estimators  is  remark¬ 
able  for  its  distributional  generality;  normality  of  the  y’s  is  not  required. 

3.  All  Pjk’s  in  B  are  correlated  with  each  other.  This  is  due  to  the  correlations 
among  the  x ’s  and  among  the  y’s.  The  P’s  within  a  given  column  of  B  are  cor¬ 
related  because  x\,  X2,  ■  ■  ■  ,xq  are  correlated.  If  x\,  X2,  ■  ■  ■  ,xq  were  orthog¬ 
onal  to  each  other,  the  P’s  within  each  column  of  B  would  be  uncorrelated. 
Thus  the  relationship  of  the  x’s  to  each  other  affects  the  relationship  of  the  P’s 
within  each  column  to  each  other.  On  the  other  hand,  the  P’s  in  each  column 
are  correlated  with  P’s  in  other  columns  because  yi ,  y2, . . .  ,  yP  are  correlated. 
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Because  of  the  correlations  among  the  columns  of  B,  we  need  multivariate 
tests  for  hypotheses  about  B.  We  cannot  use  an  F-test  from  Section  10.2.5 
on  each  column  of  B,  because  these  F -tests  would  not  take  into  account  the 
correlations  or  preserve  the  cv-level.  Some  appropriate  multivariate  tests  are 
given  in  Section  10.5. 


10.4.4  An  Estimator  for  X 

By  analogy  with  (10.8)  and  (10.9),  an  unbiased  estimator  of  cov(y,)  =  X  is  given  by 

c  E  _  (Y-XB)'(Y-XB) 

-  -  —  - 

n  —  q  —  1  ii  —  q  —  1 

Y'Y  -  B'X'Y 
n  —  q  —  1 

With  the  denominator  n  —  q  —  1,  Se  is  an  unbiased  estimator  of  X\  that  is,  E (Se)  =  X- 

10.4.5  Model  Corrected  for  Means 

If  the  x’s  are  centered  by  subtracting  their  means,  we  have  the  centered  X  matrix  as 
in  (10.13), 


(10.48) 

(10.49) 


Xr  = 


/  XU  -  XI  X\2  -  X2 
X21  -  X\  X22  -  ~X2 


\  %nl  Xfi 2  %2 

The  B  matrix  can  be  partitioned  as 

/  Poi  @02 


B 


00  1 
Bi 


X\q  %q 
%2  q  Xq 


Xnq  Xq  J 


00  p  \ 


011  012 


V  09 1  0?2 

By  analogy  with  (10.14)  and  (10.15),  the  estimates  are 

B,  =  (X'.Xt)-|X'.Y, 

P'0=?-*BU 


Pip 
Pqp  ) 


(10.50) 

(10.51) 


where  y'  =  (y1;  y2, . . .  ,  yp )  and  x'  —  (jc i ,  X2,  ■  ■  ■  ,  xq).  These  estimates  give  the 
same  results  as  B  =  (X'X)_1X'Y  in  (10.46). 
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As  in  (10.20),  the  estimate  Bi  in  (10.50)  can  be  expressed  in  terms  of  sample 
covariance  matrices.  We  multiply  and  divide  (10.50)  by  n  —  1  to  obtain 


Bi  =  (n  -  1)(X'XC) 


-l 


X(Y 


X'X, 


X'Y 


(10.52) 


where  SAX  and  Svv  are  blocks  from  the  overall  sample  covariance  matrix  of  yi, 

y>2,  ■■■  ,yp,x\,x2 —  ,xq\ 


S  = 


Syy 


(10.53) 


10.5  MULTIVARIATE  MULTIPLE  REGRESSION:  HYPOTHESIS  TESTS 

In  this  section  we  extend  the  two  tests  of  Section  10.2.5  to  the  multivariate  y  case. 
We  assume  the  .r’s  are  fixed  and  the  y ’s  are  multivariate  normal.  For  other  tests  and 
confidence  intervals,  see  Rencher  (1998,  Chapter  7). 


10.5.1  Test  of  Overall  Regression 

We  first  consider  the  hypothesis  that  none  of  the  x’s  predict  any  of  the  y’s,  which  can 
be  expressed  as  Hq  :  Bi  =  O,  where  Bi  includes  all  rows  of  B  except  the  first: 


/  Pol  A)2  •  •  •  Pop  \ 


B='  S’ 


Pll  Pl2 

\  Pql  Pq2 


Pit 


Jqp 


We  do  not  wish  to  include  /3q  =  O'  in  the  hypothesis,  because  this  would  restrict 
all  y’s  to  have  intercepts  of  zero.  The  alternative  hypothesis  is  //i :  B i  ^  O,  which 
implies  that  we  want  to  know  if  even  one  /3jk  ^  0,  /  =  1,2,...  ,q  \  k  =  1,2,...  .  p. 

The  numerator  of  (10.49)  suggests  a  partitioning  of  the  total  sum  of  squares  and 
products  matrix  Y'Y, 


Y'Y  =  (Y'Y  -  B'X'Y)  +  B'X'Y. 

By  analogy  to  (10.23),  we  subtract  nyy'  from  both  sides  to  avoid  inclusion  of 
0o  =  O': 


Y'Y  -  nyy'  =  (Y'Y  -  B'X'Y)  +  (B'X'Y  -  nyy') 


=  E  +  H. 


(10.54) 
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The  overall  regression  sum  of  squares  and  products  matrix  H  =  B'X'Y  —  nyy'  can 
be  used  to  test  Hq  :  Bi  =  O.  The  notation  E  =  Y'Y  —  B'X'Y  and  H  =  B'X'Y  —  nyy' 
conforms  with  usage  of  E  and  H  in  Chapter  6. 

As  in  Chapter  6,  we  can  test  Hq  :  Bi  =  O  by  means  of 


|E|  | Y'Y  -  B'X'Y | 

|E  +  H|  “  | Y'Y  —  nyy' |  ’ 


(10.55) 


which  is  distributed  as  A pqn-q-  \  when  Hq  :  Bi  =  O  is  true,  where  p  is  the  number 
of  y’s  and  q  is  the  number  of  x’s.  We  reject  Hq  if  A  <  Aa^p^qn-q-\.  The  likelihood 
ratio  approach  leads  to  the  same  test  statistic.  If  H  is  “large”  due  to  large  values 
of  the  Pjk’s,  then  |E  +  H|  would  be  expected  to  be  sufficiently  greater  than  |E|  so 
that  A  would  lead  to  rejection.  By  H  large,  we  mean  that  the  regression  sums  of 
squares  on  the  diagonal  are  large.  Critical  values  for  A  are  available  in  Table  A. 9 
using  vh  =  q  and  ve  =  n  —  q  —  1.  Note  that  these  degrees  of  freedom  are  the  same 
as  in  the  univariate  test  for  regression  of  y  on  x\,  xj,  . . .  ,  xq  i n  (10.24).  The  F  and 
X2  approximations  for  A  in  (6.15)  and  (6.16)  can  also  be  used. 

There  are  two  alternative  expressions  for  Wilks'  A  in  (10.55).  We  can  express  A 
in  terms  of  the  eigenvalues  kj,  X.2,  ■  ■  ■  ,  Xs  of  E-1H: 


A=n 


i 

1  +  Xj 


(10.56) 


where  v  =  mint p,  q).  Wilks’  A  can  also  be  written  in  the  form 

A  =  - — - , 

|SA-A-||S},_vl 

where  S  is  partitioned  as  in  ( 10.53): 


(10.57) 


The  form  of  A  in  (10.57)  is  the  same  as  in  the  test  for  independence  of  y  and  x  given 
in  (7.30),  where  y  and  x  are  both  random  vectors.  In  the  present  section,  the  y’s  are 
random  variables  and  the  x’s  are  fixed.  Thus  Svy  is  the  sample  covariance  matrix 
of  the  y’s  in  the  usual  sense,  whereas  Sxr  consists  of  an  analogous  mathematical 
expression  involving  the  constant  x’s  (see  comments  about  Sv;c  in  Section  10.2.4). 

By  the  symmetry  of  (10.57)  in  x  and  y,  A  is  distributed  as  Aq  p  n-p-\  as  well  as 
^p,q,n-q- 1-  This  is  equivalent  to  property  3  in  Section  6.1.3.  Hence,  if  we  regressed 
the  x’s  on  the  y’s,  we  would  get  a  different  B  but  would  have  the  same  value  of  A 
for  the  test. 

The  union-intersection  test  of  Hq  :  B  i  =  O  uses  Roy’s  test  statistic  analogous  to 

(6.20), 
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e  = 


1  +  A  i 


(10.58) 


where  A]  is  the  largest  eigenvalue  of  E  1 H.  Upper  percentage  points  6a  are  given  in 
Table  A.  10.  The  accompanying  parameters  are 

5  =  min(p,  q),  m  =  ^(| q  —  p\  —  1),  N  —  \(n  —  q  —  p  —  2). 


The  hypothesis  is  rejected  if  6  >  6a. 

As  in  Section  6.1.5,  Pillai’s  test  statistic  is  defined  as 


y<»  =  Y. 
1  =  1 


u 

1  +  Xi 


(10.59) 


and  the  Lawley-Hotelling  test  statistic  is  given  by 


U(s>  =  J2Xi’  (10.60) 

i=i 

where  X  i ,  X2 , . . .  ,XS  are  the  eigenvalues  of  E_  1 H.  For  V(s\  upper  percentage  points 
are  found  in  Table  A.ll,  indexed  by  s ,  m ,  and  n  as  defined  earlier  in  connection  with 
Roy’s  test.  Upper  percentage  points  for  veU^'/vh  (see  Section  6.1.5)  are  provided 
in  Table  A.  12,  where  vh  =  q  and  ve  =  n  —  q  —  1.  Alternatively,  we  can  use  the 
E-approximations  for  and  U^’  in  Section  6.1.5. 

When  Hq  is  true,  all  four  test  statistics  have  probability  a  of  rejecting;  that  is,  they 
all  have  the  same  probability  of  a  Type  I  error.  When  Hq  is  false,  the  power  ranking 
of  the  tests  depends  on  the  configuration  of  the  population  eigenvalues,  as  was  noted 
in  Section  6.2.  (The  sample  eigenvalues  X\,  X2, . . .  ,  Xs  from  E-1H  are  estimates  of 
the  population  eigenvalues.)  If  the  population  eigenvalues  are  equal  or  nearly  equal, 
the  power  ranking  of  the  tests  is  V<s>  >  A  >  >  0.  If  only  one  population 

eigenvalue  is  nonzero,  the  powers  are  reversed:  6  >  U<s>  >  A  >  V<s> . 

In  the  case  of  a  single  nonzero  population  eigenvalue,  the  rank  of  Bi  is  1.  There 
are  various  ways  this  could  occur;  for  example,  B 1  could  have  only  one  nonzero  row, 
which  would  indicate  that  only  one  of  the  x’s  predicts  the  v’s.  On  the  other  hand, 
a  single  nonzero  column  implies  that  only  one  of  the  y’s  is  predicted  by  the  x’s. 
Alternatively,  Bi  would  have  rank  1  if  all  rows  were  equal  or  linear  combinations 
of  each  other,  manifesting  that  all  x’s  act  alike  in  predicting  the  y’s.  Similarly,  all 
columns  equal  to  each  other  or  linear  functions  of  each  other  would  signify  only  one 
dimension  in  the  y’s  as  they  relate  to  the  x’s. 


Example  10.5.1.  For  the  chemical  data  of  Table  10.1,  we  test  the  overall  regression 
hypothesis  Hq  :  B 1  =  O.  The  E  and  H  matrices  are  given  by 
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/  80.174  -21.704  -65.923  \ 

E  =  I  -21.704  249.462  -179.496  , 

\  -65.923  -179.496  231.197  ) 

/  1707.158  -492.532  -996.584  \ 

H=  -492.532  151.002  283.607 

\  -996.584  283.607  583.688  j 


The  eigenvalues  of  E  H  are  26.3183,  .1004,  and  .0033.  The  parameters  for  use  in 
obtaining  critical  values  of  the  four  test  statistics  are 


vh  —  q  —  3,  ve  =  n  —  q  —  1  =  19  —  3— 1  =  15, 
5  =  min(3,  3)  =  3,  m  =  ^{\q  —  p\  —  1)  =  — 

N  =  \{n  —  q  —  p  —  2)  =  5.5. 


Using  the  eigenvalues,  we  obtain  the  test  statistics 

3 


TT  1  1  11 

A  ~  ,U  1  +  x‘  ~  1  +  26.3183  1  +  .1004  1  +  .0033 


=  .0332  <  A  05,3,3,15  =  -309, 


0  = 


Xi 


1  +  Ai 
3 


=  .963  >  6+05,3,0,  5)  =  .669, 


y(s)  =  Y"  — —  =  1.058  >  ,  =  1.040, 

“  1  +  A;  °5'3'0-5 

M  vEU(s) 

U(s)  =  2_/Xi=  26.422,  — - =  132.11, 


vh 


which  exceeds  the  .05  critical  value,  8.936  (interpolated),  from  Table  A.  1 1  (see  Sec¬ 
tion  6.1.5).  Thus  all  four  tests  reject  Ho.  Note  that  the  critical  values  given  for  6  and 
yW  are  conservative,  since  0  was  used  in  place  of  —.5  for  m. 

In  this  case,  the  first  eigenvalue,  26.3183,  completely  dominates  the  other  two.  In 
Example  10.4.2,  we  obtained 


Bi 


-1.55 

.40 

.91 

-1.42 

.29 

.90 

-2.24 

1.03 

1.15 

The  columns  are  approximately  proportional  to  each  other,  indicating  that  there  is 
essentially  only  one  dimension  in  the  y’s  as  they  are  predicted  by  the  x’s.  A  similar 
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statement  can  be  made  about  the  rows  and  the  dimensionality  of  the  x’s  as  they 
predict  the  y’s.  □ 


10.5.2  Test  on  a  Subset  of  the  x’s 

We  consider  the  hypothesis  that  the  y’s  do  not  depend  on  the  last  h  of  the  x’s,  xq-h+ 1 , 
Xq-h+i, ...  ,xq.  By  this  we  mean  that  none  of  the  p  y’s  is  predicted  by  any  of  these 
h  x’s.  To  express  this  hypothesis,  write  the  B  matrix  in  partitioned  form. 


B  = 


B, 
B  d 


where,  as  in  Section  10.2.5b,  the  subscript  r  denotes  the  subset  of  Pjk’s  to  be  retained 
in  the  reduced  model  and  d  represents  the  subset  of  to  be  deleted  if  they  are 
not  significant  predictors  of  the  y’s.  Thus  B,/  has  h  rows.  The  hypothesis  can  be 
expressed  as 


H0:  Brf  =  0. 

If  Xr  contains  the  columns  of  X  corresponding  to  B, ,  then  the  reduced  model  is 

Y  =  XrBr  +  H.  (10.61) 

To  compare  the  fit  of  the  full  model  and  the  reduced  model,  we  use  the  difference 
between  the  regression  sum  of  squares  and  products  matrix  for  the  full  model,  B'X'Y, 
and  regression  sum  of  squares  and  products  matrix  for  the  reduced  model,  B'.XJ.Y. 
By  analogy  to  (10.26),  this  difference  becomes  our  H  matrix: 

H  =  B'X'Y  -  B;.X;.Y.  (10.62) 

Thus  the  test  of  Hq  :  B,/  =  O  is  a  full  and  reduced  model  test  of  the  significance  of 
xq-h+ 1 ,  Xq-h+2 ,  ■ . .  ,  xq  above  and  beyond  xi ,  x%, . . .  ,  xq-i, . 

To  make  the  test,  we  use  the  E  matrix  based  on  the  full  model,  E  =  Y' Y  —  B'X'Y. 
Then 


E  +  H  =  (Y'Y  -  B'X'Y)  +  (B'X'Y  -  B^Y) 

=  y'y  -  b;.x;.y, 


and  Wilks’  A-statistic  is  given  by 


Mxq-h+l . Xq  |Xl - ,Xq-h) 


|E| 

|E  +  H| 

|  Y'Y  -  B'X'Y| 

I  y'y  -  b;.x;y|  ’ 


(10.63) 
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which  is  distributed  as  Ap^,n-q-i  when  //q :  B,/  =  O  is  true.  Critical  values  are 
available  in  Table  A.9  with  \'h  =  h  and  ve  —  n  —  q  —  1 .  Note  that  these  degrees  of 
freedom  for  the  multivariate  y  case  are  the  same  as  for  the  univariate  y  case  (multiple 
regression)  in  Section  10.2.5b.  The  F-  and  / 2 -approximations  for  A  in  (6.15)  and 
(6.16)  can  also  be  used. 

As  implied  in  the  notation  A(xq-h+\,  ■  ■  ■  ,xq\x\,...  ,  xq-h),  Wilks’  A  in  (10.63) 
provides  a  full  and  reduced  model  test.  We  can  express  it  in  terms  of  A  for  the  full 
model  and  a  similar  A  for  the  reduced  model.  In  the  denominator  of  (10.63),  we 
have  Y'Y  —  B' X'.  Y,  which  is  the  error  matrix  for  the  reduced  model  Y  =  X,  B,-  +  H 
in  (10.61).  This  error  matrix  could  be  used  in  a  test  for  the  significance  of  overall 
regression  in  the  reduced  model,  as  in  (10.55), 


|y'y  -  b;.x;.yi 
lY'Y-nyfl 


(10.64) 


Since  A,-  in  (10.64)  has  the  same  denominator  as  A  in  (10.55),  we  recognize  (10.63) 
as  the  ratio  of  Wilks’  A  for  the  overall  regression  test  in  the  full  model  to  Wilks’  A 
for  the  overall  regression  test  in  the  reduced  model. 


Mxq-h+l,  ■■■  ,Xq  \xi,  ...  ,Xq-h)  — 


|Y'Y 

|Y'Y 


B'X'Y| 

b;.x;.y| 


|Y'Y  -  B'X'Y| 
|Y'Y  —  «y  y'| 

|y'y  -  b;.x;.yi 

|Y'Y-«yy'| 

A / 

A/ 


(10.65) 


where  Af  is  given  by  (10.55).  In  (10.65),  we  have  a  convenient  computational 
device.  We  run  the  overall  regression  test  for  the  full  model  and  again  for  the  reduced 
model  and  take  the  ratio  of  the  resulting  A  values. 

The  Wilks’  A  in  (10.65),  comparing  the  full  and  reduced  models,  is  similar  in 
appearance  to  (6.127).  However,  in  (6.127),  the  full  and  reduced  models  involve  the 
dependent  variables  yi,  V2 . . . .  ,  yp  in  MANOVA,  whereas  in  (10.65),  the  reduced 
model  is  obtained  by  deleting  a  subset  of  the  independent  variables  x\,  xi, . . .  ,xq  in 
regression.  The  parameters  of  the  Wilks’  A  distribution  are  different  in  the  two  cases. 
Note  that  in  (6.127),  some  of  the  dependent  variables  were  denoted  by  x\, . . .  ,xq 
for  convenience. 

Test  statistics  due  to  Roy,  Pillai,  and  Lawley-Hotelling  can  be  obtained  from  the 
eigenvalues  of  E“XH  =  (Y'Y  -  B'X'Yr1  (B'X'Y  -  B^Y).  Critical  values  or 
approximate  tests  for  these  three  test  statistics  are  based  ontg  =  /?,  =  n  —  q  —  1, 

and 


s  =  m\n(p,  h). 


m  =  j(\h  -  p\  -  1), 


N  =  j(n  —  p  —  h  —  2) . 


MEASURES  OF  ASSOCIATION  BETWEEN  THE  y’S  AND  THEl’S 
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Example  10.5.2.  The  chemical  data  in  Table  10. 1  originated  from  a  response  surface 
experiment  seeking  to  locate  optimum  operating  conditions.  Therefore,  a  second- 
order  model  is  of  interest,  and  we  add  xf,  x|,  x|,  X1X2,  X1X3,  X2X3  to  the  variables 
xi,X2,  a'3  considered  in  Example  10.5.1.  There  are  now  q  —  9  independent  variables, 
and  we  obtain  an  overall  Wilks’  A  of 

A  =  .00145  <  A.05,3,9,9  =  .024, 

where  vh  =  q  —  9  and  ve  =  n  —  q  —  1  =  19  —  9  —  1=9.  To  see  whether  the  six 
second-order  variables  add  a  significant  amount  to  X\ ,  a 2,  A3  for  predicting  the  y’s, 
we  calculate 


A  f  .00145 

A  =  — —  =  - =  .0438  <  A  05  3  6  9  =  .049, 

Ar  .0332 

where  vh  —  h  =  6  and  ve  —  n  —  q  —  1  =  19  —  9  —  1  =  9.  In  this  case,  A,-  =  .0332 
is  from  Example  10.5.1,  in  which  we  considered  the  model  with  at,  X2,  and  A3.  Thus 
we  reject  Hq  :  Rq  —  O  and  conclude  that  the  second-order  terms  add  significant 
predictability  to  the  model.  □ 


10.6  MEASURES  OF  ASSOCIATION  BETWEEN  THE  y’s  AND  THE  a’s 


The  most  widely  used  measures  of  association  between  two  sets  of  variables  are  the 
canonical  correlations,  which  are  treated  in  Chapter  11.  In  this  section,  we  review 
other  measures  of  association  that  have  been  proposed. 

In  (10.34),  we  have  R 2  =  s'  S~A'  sVA /syy  for  the  univariate  y  case.  By  analogy,  we 
define  an  A*2 -1  i ke  measure  of  association  between  yi,  y2, ...  ,  yp  and  a  i  ,  at ,  . . .  ,  xq 
as 


R 


2  __ 

M  ~ 


(10.66) 


where  SVA-,  S.YV,  SVA- .  and  Sv>,  are  defined  in  (10.53)  and  the  subscript  M  indicates 
multivariate  y’s. 

By  analogy  to  rxy  —  sxy/sxsy  in  (3.13),  Robert  and  Escoufier  (1976)  suggested 


tr(SAVSVA) 


(10.67) 


Kabe  (1985)  discussed  the  generalized  correlation  determinant 


|L'SA.yMM'SvAL| 

|L'SaaL||M'SvVM| 


for  various  choices  of  the  transformation  matrices  L  and  M. 
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In  Section  6.1.8,  we  introduced  several  measures  of  association  that  quantify 
the  amount  of  relationship  between  the  y’s  and  the  dummy  grouping  variables  in  a 
MANOVA  context.  These  are  even  more  appropriate  here  in  the  multivariate  regres¬ 
sion  setting,  where  both  the  x’s  and  the  y’s  are  continuous  variables.  The  A2-li ke 
indices  given  in  (6.41),  (6.43),  (6.46),  (6.49),  and  (6.51)  range  between  0  and  1  and 
will  be  briefly  reviewed  in  the  remainder  of  this  section.  For  more  complete  com¬ 
mentary,  see  Section  6.1.8. 

The  two  measures  based  on  Wilks’  A  are 

vi  =  i  -  A- 

Aa  =  1  -  A1/s, 


where  .v  =  mint p,  q).  A  measure  based  on  Roy’s  0  is  provided  by  6  itself. 


where  ),\  is  the  largest  eigenvalue  of  E-1H.  This  was  identified  in  Section  6.1.8  as 
the  square  of  the  first  canonical  correlation  (see  Chapter  11)  between  the  y’s  and 
the  grouping  variables  in  MANOVA.  In  the  multivariate  regression  setting,  9  is  the 
square  of  the  first  canonical  correlation,  r2,  between  the  y’s  and  the  x’s. 

Measures  of  association  based  on  the  Lawley-Hotelling  and  Pillai  statistics  are 
given  by 


U(s)/s 
1  +  U^/s  ’ 


(10.68) 


By  (6.48)  and  (6.49),  Ap  in  (10.68)  is  the  average  of  the  s  squared  canonical  corre¬ 
lations,  r2,  r|, . . .  ,  r~. 


Example  10.6.  We  use  the  chemical  data  of  Table  10.1  to  illustrate  some  measures 
of  association.  For  the  three  dependent  variables  y\,  y2,  and  y3  and  the  three  inde¬ 
pendent  variables  x\,  xj,  and  X3,  the  partitioned  covariance  matrix  is 


'  99.30 

-28.57 

-59.03 

-28.57 

22.25 

5.78 

-59.03 

5.78 

45.27 

-41.95 

11.85 

24.14 

-9.49 

1.60 

6.43 

,  -7.36 

3.03 

3.97 

-41.95  -9.49  -7.37 

11.85  1.60  3.03 

24.14  6.43  3.97 

38.67  -12.17  -.22 

-12.17  17.95  1.22 

-.22  1.22  2.67 
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from  which  we  obtain  R2M  and  RV  directly  using  (10.66)  and  (10.67), 


R~m  = 


\svxs~}s_ 


=  .00029, 


Jyy\ 


tr(SvvSVv) 

RV  =  '  —  =  .403. 

tr(S2.  v)  tr(S^v) 


Using  the  results  in  Example  10.5.1,  we  obtain 


T)\  =  1  —  A  =  1  —  .0332  =  .967, 
Aa  =  1  -  A 1  /,s  =  1  -  AI/3  =  .679, 


vl  =  =  -963, 


Alh  = 


1+T-i 

U{s)/S 


26.422/3 


l  +  l/W/s  1  +  26.422/3 


=  .898, 


V(s)  1.058 

AP  = - = - =  .352. 

s  3 


We  have  general  agreement  among  r]\,  Aa,  rji,  and  Alh-  But  R^,  RV,  and  Ap  do 
not  appear  to  be  measuring  the  same  level  of  association,  especially  R2M.  It  appears 
that  more  study  is  needed  before  one  or  more  of  these  measures  can  be  universally 
recommended.  □ 


10.7  SUBSET  SELECTION 

As  in  the  univariate  y  case  in  Section  10.2.7,  there  may  be  more  potential  predictor 
variables  (x’s)  than  are  useful  in  a  given  situation.  Some  of  the  x’s  may  be  redundant 
in  the  presence  of  the  other  x’s. 

We  may  also  be  interested  in  deleting  some  of  the  v’s  if  they  are  not  well  predicted 
by  any  of  the  x’s.  This  would  lead  to  further  simplification  of  the  model. 

We  present  two  approaches  to  subset  selection:  stepwise  procedures  and  methods 
involving  all  possible  subsets. 

10.7.1  Stepwise  Procedures 

Subset  selection  among  the  x’s  is  discussed  in  Section  10.7.1a,  followed  by  selection 
among  the  y’s  in  Section  10.7.1b. 

10.7.1a  Finding  a  Subset  of  the  x  ’v 

We  begin  with  the  forward  selection  procedure  based  on  Wilks’  A.  At  the  first  step, 
we  test  the  regression  of  the  p  y’ s  on  each  x j .  There  will  be  two  rows  in  the  B  matrix, 
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a  row  containing  the  intercepts  and  a  row  corresponding  to  xj : 


A>i  Ac  •  •  •  Aip  \ 
A/i  Pj2  ■■■  Pjp  )' 


We  use  the  overall  regression  test  statistic  (10.55), 


A  (xj)  = 


|Y'Y  -  B^.X'.YI 
|Y'Y-ny?| 


which  is  distributed  as  Ap  i  n-2,  since  B;  has  two  rows  and  X;  has  two  columns. 
After  calculating  A  (xj )  for  each  j,  we  choose  the  variable  with  minimum  A  (xj). 
Note  that  at  the  first  step,  we  are  not  testing  each  variable  in  the  presence  of  the 
others.  We  are  searching  for  the  variable  x,  that  best  predicts  the  p  y’s  by  itself,  not 
above  and  beyond  the  other  x’s. 

At  the  second  step,  we  seek  the  variable  yielding  the  smallest  partial  A  for  each  x 
adjusted  for  the  variable  first  entered,  where  the  partial  A-statistic  is  given  by  (10.65). 
After  one  variable  has  entered,  (10.65)  becomes 

A(xi ,  x;) 

A{xj\xl)=  (10.69) 

A  (x  i) 

where  x\  denotes  the  variable  entered  at  the  first  step.  We  calculate  (10.69)  for  each 
Xj  ^  x\  and  choose  the  variable  that  minimizes  A  (xj  |xi). 

If  we  denote  the  second  variable  to  enter  by  x%,  then  at  the  third  step  we  seek  the 
Xj  that  minimizes 


A(xi ,  xi ,  Xj) 

A(xj\xl,x2)  =  \j.  (10.70) 

AUt,  x2) 

By  property  7  in  Section  6.1.3,  the  partial  Wilks’  A-statistic  transforms  to  an  exact 
F  since  —h  —  1  at  each  step. 

After  m  variables  have  been  selected,  the  partial  A  would  have  the  following  form 
at  the  next  step: 


A(x j  \xx ,  x2, . . .  ,xm)  = 


A(xi,  X2,  ■  ■  ■  ,  xm,  Xj) 
A(xi,x2,  ...  ,xm) 


(10.71) 


where  the  first  m  variables  to  enter  are  denoted  x\,  xj,  ■  ■  ■  ,  xm ,  and  xj  is  a  candi¬ 
date  variable  from  among  the  q  —  m  remaining  variables.  At  this  step,  we  would 
choose  the  Xj  that  minimizes  (10.71).  The  partial  Wilks’  A  in  (10.71)  is  distributed 
as  App  n-m-\  and,  by  Table  6.1,  transforms  to  a  partial  /-’-statistic  distributed  as 
Fp  p.  [These  distributions  hold  for  a  variable  Xj  chosen  before  seeing  the  data 
but  not  for  the  x j  that  minimizes  (10.71)  or  maximizes  the  corresponding  partial  /-’.] 
The  procedure  continues,  bringing  in  the  “best”  variable  at  each  step,  until  a  step  is 
reached  at  which  the  minimum  partial  A  exceeds  a  predetermined  threshold  value  or, 


SUBSET  SELECTION 


353 


equivalently,  the  associated  partial  F  falls  below  a  preselected  value.  Alternatively, 
the  stopping  rule  can  be  cast  in  terms  of  the  /? -value  of  the  partial  A  or  F .  If  the 
smallest  /j- value  at  some  step  exceeds  a  predetermined  value,  the  procedure  stops. 

For  each  xj,  there  corresponds  an  entire  row  of  the  B  matrix  because  xj  has  a 
coefficient  for  each  of  the  p  y’s.  Thus  if  a  certain  x  significantly  predicts  even  one  of 
the  y’s,  it  should  be  retained. 

The  stepwise  procedure  is  an  extension  of  forward  selection.  Each  time  a  variable 
enters,  all  the  variables  that  have  entered  previously  are  checked  by  a  partial  A  or  F 
to  see  if  the  least  “significant”  one  is  now  redundant  and  can  be  deleted. 

The  backward  elimination  procedure  begins  with  all  x’s  (all  rows  of  B)  included 
in  the  model  and  deletes  one  at  a  time  using  a  partial  A  or  F.  At  the  first  step,  the 
partial  A  for  each  Xj  is 


A(x;-|*l - -  Xj-l,Xj+l,  ...  ,Xq)  = 


_ A (-Y 1 ,  ...  ,Xq) _ 

A(X1,  ...  ,Xj-\,Xj+\ - ,Xq)' 


(10.72) 


which  is  distributed  as  Ap  i i  and  can  be  converted  to  Fpn-q-p  by  Table  6.1. 
The  variable  with  largest  A  or  smallest  F  is  deleted.  At  the  second  step,  a  partial 
A  or  F  is  calculated  for  each  of  the  q  —  1  remaining  variables,  and  again  the  least 
important  variable  in  the  presence  of  the  others  is  eliminated.  This  process  continues 
until  a  step  is  reached  at  which  the  largest  A  or  smallest  F  is  significant,  indicating 
that  the  corresponding  variable  is  apparently  not  redundant  in  the  presence  of  its 
fellows.  Some  preselected  /7-value  or  threshold  value  of  A  or  F  is  used  to  determine 
a  stopping  rule. 

If  no  automated  program  is  available  for  subset  selection  in  the  multivariate  case, 
a  forward  selection  or  backward  elimination  procedure  could  be  carried  out  by  means 
of  a  rather  simple  set  of  commands  based  on  (10.71)  or  (10.72). 

A  sequential  procedure  such  as  stepwise  selection  will  often  fail  to  find  the  opti¬ 
mum  subset,  especially  if  a  large  pool  of  predictor  variables  is  involved.  Flowever, 
the  value  of  Wilks’  A  found  by  stepwise  selection  may  not  be  far  from  that  for  the 
optimum  subset. 

The  remarks  in  the  final  paragraph  of  Section  10.2.7b  are  pertinent  in  the  mul¬ 
tivariate  context  as  well.  True  predictors  of  the  y’s  in  the  population  may  be  over¬ 
looked  because  of  inflated  error  variances,  or,  on  the  other  hand,  x ’s  that  are  not  true 
predictors  may  appear  to  be  so  in  the  sample.  The  latter  problem  can  be  severe  for 
small  sample  sizes  (Rencher  and  Pun  1980). 


10.7.1b  Finding  a  Subset  of  the  y ’s 

After  a  subset  of  .r’s  has  been  found,  the  researcher  may  wish  to  know  if  these  jc’s 
predict  all  p  of  the  y’s.  If  some  of  the  y’s  do  not  relate  to  any  of  the  x’s,  they  could  be 
deleted  from  the  model  to  achieve  a  further  simplification.  The  y’s  can  be  checked  for 
redundancy  in  a  manner  analogous  to  the  stepwise  discriminant  approach  presented 
in  Sections  6.11.2  and  8.9,  which  finds  subsets  of  dependent  variables  using  a  full  and 
reduced  model  Wilks’  A  for  the  y’s.  The  partial  A-statistic  for  adding  or  deleting  a  y 
is  similar  to  (10.69),  (10.70),  or  (10.71),  except  that  dependent  variables  are  involved 
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rather  than  independent  variables.  Thus  to  add  a  y  at  the  third  step  of  a  forward 
selection  procedure,  for  example,  where  the  first  two  variables  already  entered  are 
denoted  as  y\  and  >’2 ,  we  use  (6.128)  to  obtain 


Myj\y\,yi) 


A(yi,  yi,  yj) 
A(yi,  y2) 


(10.73) 


for  each  yj  f  yi  or  yi,  and  we  choose  the  yj  that  minimizes  A(y/|yi,  yf).  [In 
(6.128)  the  dependent  variable  of  interest  was  denoted  by  x  instead  of  yj  as  here.] 
Similarly,  if  three  y’s,  designated  as  yi ,  yi,  and  V3 ,  were  “in  the  model”  and  we  were 
checking  the  feasibility  of  adding  yj,  the  partial  A-statistic  would  be 


A(y/|yi,  yi,  yi) 


A(yi,  yi,  yj,  yf) 
A(vi,  yi,  yi) 


(10.74) 


which  is  distributed  as  A\qn-q-4,  where  q  is  the  number  of  x’s  presently  in  the 
model  and  4  is  the  number  of  y’s  presently  in  the  model.  The  two  Wilks  A  values 
in  the  numerator  and  denominator  of  the  right  side  of  (10.74),  A(yi,  >’2,  y^,  yj)  and 
A(yi,  }’2,  y3),  are  obtained  from  (10.55).  Since  p  —  1,  Aiq„-q-4  in  (10.74)  trans¬ 
forms  to  Fq M-q-4  (see  Table  6.1). 

In  the  hrst  step  of  a  backward  elimination  procedure,  we  would  delete  the  y  -}  that 
maximizes 


Myjlyi,  •••  , yj- 1,  yj+ 1.  •••  ,yP)  = 


_ A(yi,  . . ,  ,yp) _ 

A(yi,...  ,yj_i,yj+1,...  ,yp)’ 


(10.75) 


which  is  distributed  as  A \  VH  VF—p+\ .  In  this  case,  vh  =  q  and  ve  —  n  —  q  —  1  so 
that  the  distribution  of  (10.75)  is  Aj  q  „_9_p,  which  can  be  transformed  to  an  exact 
F .  Note  that  q,  the  number  of  x’s,  may  have  been  reduced  in  a  subset  selection  on 
the  x’s,  as  in  Section  10.7.1a.  Similarly,  p  is  the  number  of  y’s  and  will  decrease  in 
subsequent  steps. 

Stopping  rules  for  either  the  forward  or  backward  approach  could  be  defined  in 
terms  of  /2-values  or  threshold  values  of  A  or  the  equivalent  F.  A  stepwise  procedure 
could  be  devised  as  a  modification  of  the  forward  approach. 

If  a  software  program  is  available  that  tests  the  significance  of  one  x  as  in  (10.72), 
it  can  be  adapted  to  test  one  y  as  in  (10.75)  by  use  of  property  3  of  Section  6.1.3: 
The  distribution  of  APjVHjVe  is  the  same  as  that  of  AVHiPrVE+VH-p,  which  can  also 
be  seen  from  the  symmetry  of  A  in  (10.57), 

A  = - — - , 

IS^USyyl 


which  is  distributed  as  Ap  q  n-q-\  or,  equivalently,  as  Aq Thus  we  can 
reverse  the  y’s  and  x’s;  we  list  the  x’s  as  dependent  variables  in  the  program  and  the 
y’s  as  independent  variables.  Then  the  test  of  a  single  y  in  (10.74)  or  (10.75)  can  be 
carried  out  using  (10.72)  without  any  adjustment.  The  partial  A-statistic  in  (10.72) 
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is  distributed  as  Ap\n-q-\.  If  we  interchange  p  and  q,  because  the  y’s  and  x’s  are 
interchanged  as  dependent  and  independent  variables,  this  becomes  Aq\n-p-\.  By 
property  3  of  Section  6.1.3  (repeated  above),  this  is  equivalent  to  = 

A\  q  „-p-q,  which  is  the  distribution  of  (10.75). 

10.7.2  All  Possible  Subsets 

In  Section  10.2.7a,  we  discussed  the  criteria  R 2,  s2,  and  Cp  for  comparing  all  possi¬ 
ble  subsets  of  x  ’s  to  predict  a  univariate  y  in  multiple  regression,  where  p—  1  denotes 
the  number  of  x’s  in  a  subset  selected  from  a  pool  of  k  —  1  available  independent 
variables.  We  now  discuss  some  matrix  analogues  of  these  criteria  for  the  multivari¬ 
ate  y  case,  as  suggested  by  Mallows  (1973)  and  Sparks,  Coutsourides,  and  Troskie 
(1983). 

In  this  section,  in  order  to  conform  with  notation  in  the  literature,  we  will  use  p 
for  the  number  of  columns  in  X  (and  the  number  of  rows  in  B),  rather  than  for  the 
number  of  y’s.  The  number  of  y’s  will  be  indicated  by  m. 

We  now  extend  the  three  criteria  R2p,  s2,  and  Cp  to  analogous  matrix  expressions 
Ry;,  S p,  and  Cp.  These  can  be  reduced  to  scalar  form  using  trace  or  determinant. 

1.  Rj,.  In  the  univariate  y  case,  R2  for  a  ( p  —  1) -variable  subset  of  x’s  is  defined 
by  (10.32)  as 


2  =  KKi  ~ n y 2 

p  y'y  -  n~y2 

A  direct  extension  of  R2  for  the  multivariate  y  case  is  given  by  the  matrix 

R^  =  (Y'Y  -  ny  y')-1(B'/;X;Y  -  riff),  (10.76) 

where  p  —  1  is  the  number  of  x’s  selected  from  the  k  —  1  available  x’s.  To  convert 
R2  to  scalar  form,  we  can  use  tr(R^)/w,  in  which  we  divide  by  m,  the  number  of 
y’s,  so  that  0  <  h(R  2p)/m  <  1.  As  in  the  univariate  case,  we  identify  the  subset 
that  maximizes  tr(R“)/w  for  each  value  of  p  =  2,  3, . . .  ,  k.  The  criterion  tr(Rp/m 
does  not  attain  its  maximum  until  p  reaches  k,  but  we  look  for  the  value  of  p  at 
which  further  increases  are  deemed  unimportant.  We  could  also  use  |R^|  in  place  of 
tr(R  2p)/m. 

2.  Sp.  A  direct  extension  of  the  univariate  criterion  s2  =  M.SE/;  =  SSE p/(n  —  p) 
is  provided  by 


E  „ 

S  p  =  (10.77) 

n  —  p 

where  Ep  =  Y'Y— B'^X'^Y.  To  convert  to  a  scalar,  we  can  use  tr(S;))  or  |SP  | ,  either  of 
which  will  behave  in  an  analogous  fashion  to  s2  in  the  univariate  case.  The  remarks  in 
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Section  10.2.7a  apply  here  as  well;  one  may  wish  to  select  the  subset  with  minimum 
value  of  tr(Sp)  or  perhaps  the  subset  with  smallest  p  such  that  tr(Sp)  <  tr(S/j.  A 
similar  application  could  be  made  with  |SP|. 

3.  C p.  To  extend  the  Cp  criterion  to  the  multivariate  y  case,  we  write  the  model 
under  consideration  as 


Y  =  XpBp  +  S, 

where  —  1  is  the  number  of  x’s  in  the  subset  and  k  —  1  is  the  number  of  x’s  in  the 
“full  model.”  The  predicted  values  of  the  y’s  are  given  by 

Y  =  XpRp. 

We  are  interested  in  predicted  values  of  the  observation  vectors,  namely,  yi. 
y2,  ■  ■  ■  ,  yn ,  which  are  given  by  the  rows  of  Y : 


( y't  ^ 

( v  ^ 

(  xPiBp  \ 

fi 

— 

Xp2 

»  P  = 

Xp2^P 

\  fn  ) 

V  x'pn  ) 

V  xp^P  ) 

In  general,  the  predicted  vectors  y ;  are  biased  estimators  of  E( y,)  in  the  correct 
model,  because  we  are  examining  many  competing  models  for  which  E( y,)  ^ 
E(yi).  In  place  of  the  univariate  expected  squared  error  E\yi  —  /i(y,)]2  in  (10.37) 
and  (10.38),  we  define  a  matrix  of  expected  squares  and  products  of  errors,  Z?[y,  — 
£,(y,)][y1-  —  E(yi)]'.  We  then  add  and  subtract  E (y; )  to  obtain  (see  Problem  10.8) 


E[ ft  -  E(yi)][yi  -  E{ y,-)]' 

=  E[y,  -  E(jt)  +  E(y,)  -  £(y/)][y;'  -  E(y,)  +  E(y,)  -  E{y^ 

=  E[y,  -  EQiWi  -  E(yi)]'  +  [£(y,)  -  £(yi)][£(y;)  -  £(y,)]' 

=  cov(y i)  +  (bias  in  y,)(bias  in  y,)'.  (10.78) 

By  analogy  to  (10.39),  our  Cp  matrix  will  be  an  estimate  of  the  sum  of  (10.78), 
multiplied  by  2-1  for  standardization. 

We  first  find  an  expression  for  cov(y,  ),  which  for  convenience  we  write  in  row 
form. 


cov(yj)  =  cov(x,/7,B  p)  -  cov(x'pipp(l),  x'pipp{2), ,  x'pipp{m)), 


where  B  =  (J8(i),  (3(2),  ■  ■  ■  ,  P (m))>  in  (10.47).  This  can  be  written  as 
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(  G]  1  Xpi  (XpXp)  lXpi 


cov(y/)  = 


V  amix'pi(X'pXp)-lxpi 
’pi&pZp) 


-pi  '•-‘■p 

=  x'AX'XprlxpiX, 


almx'pi(X'pXp)-lxpi  \ 


(XpX^)  Xpi  J 


(10.79) 


where  m  is  the  number  of  y’s  and  X  =  cov(y ,)  (see  Problem  10.9).  As  in  (10.41) 
(see  also  Problem  10.5),  we  can  sum  over  the  n  observations  and  use  (3.65)  to  obtain 


YJCov(y'i)  =  J2x'pi(-X'PXp'> 

i= 1  /=1 

n 

=  xJ2x,P,(x'pxpr'xpi  =  Px.  (10.80) 

i= 1 

To  sum  the  second  term  on  the  right  of  (10.78),  we  have,  by  analogy  to  (10.42), 


^2 (bias  in  y,) (bias  iny,)'  =  (n  -  p)E( Sp  -  X),  (10.81) 

i=i 

where  Sp  is  given  by  (10.77). 

Now  by  (10.80)  and  (10.81),  we  can  sum  (10.78)  and  multiply  by  2-1  to  obtain 
the  matrix  of  total  expected  squares  and  products  of  error  standardized  by  X-1, 

n 

X~l  Y,  OTi  -  E(ji)Y  =  X~l[PX  +  (n  -  P)E(S„  -  2)] 

/=i 

=  pi  +  (n  -  p)X~1E( Sp  -  X).  (10.82) 

Using  S k  —  E k/(n  —  k),  the  sample  covariance  matrix  based  on  all  k  —  1  variables, 
as  an  estimate  of  X,  we  can  estimate  (10.82)  by 

Cp  =  pi  +  (n  -  p)S^(Sp  -  Sk)  (10.83) 

=  S-'E;,  +  (2 p  -  n) I  [by  (10.77)],  (10.84) 

which  is  the  form  suggested  by  Mallows  (1973).  We  can  use  tr(Cp)  or  |C/;  to  reduce 
this  to  a  scalar.  But  if  2 p  —  n  is  negative,  |C;,  can  be  negative,  and  Sparks,  Cout- 
sourides,  and  Troskie  (1983)  suggested  a  modification  of  |Cp|, 

|CP|  =  |E^xEp|,  (10.85) 


which  is  always  positive. 
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To  find  the  optimal  subset  of  x’s  for  each  value  of  p,  we  could  examine  all  possible 
subsets  [or  use  a  computational  scheme  such  as  that  of  Furnival  and  Wilson  (1974)] 
and  look  for  the  “smallest”  Cp  matrix  for  each  p.  In  (10.82),  we  see  that  when  the 
bias  is  O,  the  “population  Cp ”  is  equal  to  pi.  Thus  we  seek  a  Cp  that  is  “small”  and 
near  pi.  In  terms  of  trace,  we  seek  tr(Cp)  close  to  pm,  where  m  is  the  number  of  v’s 
in  the  vector  of  measurements;  that  is,  tr(I)  =  m. 

To  find  a  “target”  value  for  (10.85),  we  write  1  Ep  in  terms  of  Cp  from  (10.84), 


E7iE„ = 


Cp  +  ( n  -  2p)I 


(10.86) 


When  the  bias  is  O,  we  have  Cp  —  pi,  and  (10.86)  becomes 


E  *  %  = 


n  ~  p 
n  —  k 


whence,  by  (2.85), 


iEr%i  = 


(10.87) 


(10.88) 


Thus  we  seek  subsets  such  that 

tr(Cp)  <  pm  or  lE^'EpI  < 

In  summary,  when  examining  all  possible  subsets,  any  or  all  of  the  following 
criteria  may  be  useful  in  finding  the  single  best  subset  or  the  best  subset  for  each  p: 

tr(Rp)/ m,  |Rp|,  tr(Sp),  |SP|,  tr(C„),  \E^Ep\. 

10.8  MULTIVARIATE  REGRESSION:  RANDOM  x’s 

In  Sections  10.4  and  10.5,  it  was  assumed  that  the  x’s  were  fixed  and  would  have 
the  same  values  in  repeated  sampling.  In  many  applications,  the  x’s  are  random  vari¬ 
ables.  In  such  a  case,  the  values  of  xi,  X2,  •  •  •  ,  xq  are  not  under  the  control  of  the 
experimenter  but  occur  randomly  along  with  yt,  •  •  •  •  Vp-  On  each  subject,  we 
observe  p  +  q  values  in  the  vector  (vi,  yi, . . .  ,  yp,  xi,  xi, . . .  ,  xq). 

If  we  assume  that  (yi,  yj, . . .  ,  yp,  xi,  X2, . . .  ,  xq)  has  a  multivariate  normal  dis¬ 
tribution,  then  all  estimates  and  tests  have  the  same  formulation  as  in  the  fixed-x  case 
[for  details,  see  Rencher  (1998,  Section  7.7)].  Thus  there  is  no  essential  difference  in 
our  procedures  between  the  fixed-x  case  and  the  random-x  case. 
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PROBLEMS 

10.1  Show  that  EU  (. yi  -  *;/3)2  =  (y  -  X/3)'(y  -  x/jj  as  in  (10.6). 

10.2  Show  that  Xu=i  O';  -ft)2  =  E"=|(>V  -  y)2  +  n(y  -  m)2  as  in  (10.10). 

10.3  Show  that  YH=i  (xi2  —  xi)y  —  0  as  in  (10.19). 

10.4  Show  that  E\yi  —  E (y,-)]2  =  E[yi  -  E(yi)]2  +  [E(yj)  -  E(yj)]2  as  in  (10.37). 

10.5  Show  that  £?=1  var(.y,)/o-2  =  tr  [Xp(X'pXpylX'p\  as  in  (10.40). 

10.6  Show  that  the  alternative  form  of  Cp  in  (10.45)  is  equal  to  the  original  form 
in  (10.44). 

10.7  Show  that  (10.48)  is  the  same  as  (10.49),  that  is,  (Y  -  XB)'(Y  -  XB)  = 
Y'Y  -  B'X'Y. 

10.8  Show  that 

E[ fi  -  £(y;)][y;  -  E( y,)]' 

Ely,  -  £(y/)][y/  -  £(&)]'  +  [£(y,0  -  £(y/)][£(y,)  -  E{ y,)]', 

thus  verifying  (10.78). 

10.9  Show  that  cov(y')  has  the  form  given  in  (10.79). 

10.10  Show  that  the  two  forms  of  Cp  in  ( 10.83)  and  (10.84)  are  equal. 

10.11  Explain  why  | 1 |  >  0,  as  claimed  following  (10.85). 

10.12  Show  that  E^T 1  Ep  =  [ Cp  +  ( n  —  2p)\\/{n  —  k),  as  in  (10.86),  where  Cp  is 
given  in  (10.83). 

10.13  Show  that  if  Cp  =  pi,  then  E^lEp  =  [(n  -  p)/(n  -  k)]I  as  in  (10.87). 

10.14  Use  the  diabetes  data  of  Table  3.4. 

(a)  Find  the  least  squares  estimate  B  for  the  regression  of  (yi,  >’2)  on  (xi ,  xj, 
X3). 

(b)  Test  the  significance  of  overall  regression  using  all  four  test  statistics. 

(c)  Determine  what  the  eigenvalues  of  E_  1 H  reveal  about  the  essential  rank 
of  B 1  and  the  implications  of  this  rank,  such  as  the  relative  power  of  the 
four  tests. 

(d)  Test  the  significance  of  each  of  x  1 ,  xo,  and  X3  adjusted  for  the  other  two 
x’s. 

10.15  Use  the  sons  data  of  Table  3.7. 

(a)  Find  the  least  squares  estimate  B  for  the  regression  of  (y  1 ,  y2)  on  (xi ,  X2)- 

(b)  Test  the  significance  of  overall  regression  using  all  four  test  statistics. 
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(c)  Determine  what  the  eigenvalues  of  E-1H  reveal  about  the  essential  rank 
of  B  i  and  the  implications  of  this  rank,  such  as  the  relative  power  of  the 
four  tests. 

(d)  Test  the  significance  of  x\  adjusted  for  X2  and  of  X2  adjusted  for  x\ . 

10.16  Use  the  glucose  data  of  Table  3.8. 

(a)  Find  the  least  squares  estimate  B  for  the  regression  of  (yi,  yi ,  yg )  on  (xi, 
X2,  X3). 

(b)  Test  the  significance  of  overall  regression  using  all  four  test  statistics. 

(c)  Determine  what  the  eigenvalues  of  E-1H  reveal  about  the  essential  rank 
of  Bj  and  the  implications  of  this  rank,  such  as  the  relative  power  of  the 
four  tests. 

(d)  Test  the  significance  of  each  of  xi,  X2,  and  X3  adjusted  for  the  other  two 

x’s. 

(e)  Test  the  significance  of  each  y  adjusted  for  the  other  two  by  using  (10.75). 

10.17  Use  the  Seishu  data  of  Table  7.1. 

(a)  Find  the  least  squares  estimate  B  for  the  regression  of  f  v i ,  >'2 )  on  (xi, 

X2, . . .  ,  xb)  and  test  for  significance. 

(b)  Test  the  significance  of  (X7,  xg)  adjusted  for  the  other  x’s. 

(c)  Test  the  significance  of  (X4,  X5,  xy, )  adjusted  for  the  other  x’s. 

(d)  Test  the  significance  of  (xi,  xi,  X3)  adjusted  for  the  other  x’s. 

10.18  Use  the  Seishu  data  of  Table  7.1. 

(a)  Do  a  stepwise  regression  to  select  a  subset  of  xi,  X2, . . .  ,  xg  that  ade¬ 
quately  predicts  (yi,  yi). 

(b)  After  selecting  a  subset  of  x’s,  use  the  methods  of  Section  10.7.1b  to 
check  if  either  of  the  y’s  can  be  deleted. 

10.19  Use  the  temperature  data  of  Table  7.2. 

(a)  Find  the  least  squares  estimate  B  for  the  regression  of  (vy,  vg ,  \y>)  on  (yi, 
y2,  B)  and  test  for  significance. 

(b)  Find  the  least  squares  estimate  B  for  the  regression  of  (y7,  yg,  yg)  on 
(yi , . . .  ,  yg)  and  test  for  significance. 

(c)  Find  the  least  squares  estimate  B  for  the  regression  of  (vio,  yn)  on 
(yi , . . .  ,  V9)  and  test  for  significance. 


10.20  Using  the  temperature  data  of  Table  7.2,  carry  out  a  stepwise  regression  to 
select  a  subset  of  y\,  y-±,  ■  ■  ■  ,yg  that  adequately  predicts  (yio,  yi  1 ). 


CHAPTER  11 


Canonical  Correlation 


11.1  INTRODUCTION 

Canonical  correlation  analysis  is  concerned  with  the  amount  of  (linear)  relationship 
between  two  sets  of  variables.  We  often  measure  two  types  of  variables  on  each 
research  unit — for  example,  a  set  of  aptitude  variables  and  a  set  of  achievement  vari¬ 
ables,  a  set  of  personality  variables  and  a  set  of  ability  measures,  a  set  of  price  indices 
and  a  set  of  production  indices,  a  set  of  student  behaviors  and  a  set  of  teacher  behav¬ 
iors,  a  set  of  psychological  attributes  and  a  set  of  physiological  attributes,  a  set  of 
ecological  variables  and  a  set  of  environmental  variables,  a  set  of  academic  achieve¬ 
ment  variables  and  a  set  of  measures  of  job  success,  a  set  of  closed-book  exam  scores 
and  a  set  of  open-book  exam  scores,  and  a  set  of  personality  variables  of  freshmen 
students  and  the  same  variables  on  the  same  students  as  seniors. 


11.2  CANONICAL  CORRELATIONS  AND  CANONICAL  VARIATES 


We  assume  that  two  sets  of  variables  y7  =  ( yi ,  y 2, ■■ ■  ,  yp)  and  x7  —  (xi,  X2, ...  ,xq) 
are  measured  on  the  same  sampling  unit.  We  denote  the  two  sets  of  variables  as  y 
and  x  to  conform  to  notation  in  Chapters  3,  7,  and  10.  In  Section  7.4.1,  we  discussed 
the  hypothesis  that  y  and  x  were  independent.  In  this  chapter,  we  consider  a  measure 
of  overall  correlation  between  y  and  x. 

Canonical  correlation  is  an  extension  of  multiple  correlation,  which  is  the  cor¬ 
relation  between  one  y  and  several  x’s  (see  Section  10.2.6).  Canonical  correlation 
analysis  is  often  a  useful  complement  to  a  multivariate  regression  analysis. 

We  first  review  multiple  correlation.  The  sample  covariances  and  correlations 
among  y,  x\ ,  X2,  ■  ■  ■  ,xq  can  be  summarized  in  the  matrices 


S  = 


R  = 


(11.1) 

(11.2) 
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where  s'  =  (syi,  sy2,  ■  ■  ■  ,  syq )  contains  the  sample  covariances  of  y  with  xi, 
X2,  ■  ■  ■  •  xcj  and  SAA  is  the  sample  covariance  matrix  of  the  x’s  [see  (10.16)].  The 
partitioned  matrix  R  is  defined  analogously;  r'v  v  =  (ry\ ,  ry 2, . . .  ,  ryq)  contains  the 
sample  correlations  of  y  with  x\,  X2,  ■  ■  ■  ,xq,  and  R„  is  the  sample  correlation 
matrix  of  the  x’s  [see  (10.35)]. 

By  (10.34),  the  squared  multiple  correlation  between  y  and  the  x’s  can  be  com¬ 
puted  from  the  partitioned  covariance  matrix  (11.1)  or  correlation  matrix  (11.2)  as 
follows: 


R2  = 


S'yxSxx  Syx  „  ] 

2  ~ryxs^xxryx- 

sy 


(11.3) 


In  R 2,  the  q  covariances  between  y  and  the  x’s  in  syx  or  the  q  correlations  between  y 
and  the  x ’s  in  ryx  are  channeled  into  a  single  measure  of  linear  relationship  between  y 
and  the  x’s.  The  multiple  correlation  R  can  be  defined  alternatively  as  the  maximum 
correlation  between  y  and  a  linear  combination  of  the  x’s;  that  is,  R  —  maxi,  rxtys. 

We  now  return  to  the  case  of  several  y’s  and  several  x’s.  The  covariance  struc¬ 
ture  associated  with  two  subvectors  y  and  x  was  first  discussed  in  Section  3.8.1. 
By  (3.42),  the  overall  sample  covariance  matrix  for  (y\, . . .  ,  yp,  xi, . . .  ,  xq)  can  be 
partitioned  as 


where  Svv  is  the  p  x  p  sample  covariance  matrix  of  the  y’s,  Sy.v  is  the  p  x  q  matrix 
of  sample  covariances  between  the  y’s  and  the  x’s,  and  SAl  is  the  q  x  q  sample 
covariance  matrix  of  the  x’s. 

In  Section  10.6,  we  discussed  several  measures  of  association  between  the  y’s 
and  the  x’s.  The  first  of  these  is  defined  in  (10.66)  as  R2M  =  | S v,v S S r:y | / 1 Sy v | , 
which  is  analogous  to  R2  =  s'yxS~xs yx/s2  in  (1 1.3).  By  (2.89)  and  (2.91),  R2M  can 
be  rewritten  as  R2M  =  IS^S^S^'S^I.  By  (2.108),  R2M  can  be  expressed  as 

/^  =  |S^Sy,S2Xvl  =  ri^ 

i= 1 

where  5  =  min (p,  q)  and  r2,  r|, ...  ,r2  are  the  eigenvalues  of  S^Sy.vS^S.vy. 
When  written  in  this  form,  R2M  is  seen  to  be  a  poor  measure  of  association  because 
0  <  r  ?  <  1  for  all  i,  and  the  product  will  usually  be  too  small  to  meaningfully  reflect 
the  amount  of  association.  (In  Example  10.6,  R2M  =  .00029  was  a  tiny  fraction  of 
the  other  measures  of  association.)  The  eigenvalues  themselves,  on  the  other  hand, 
provide  meaningful  measures  of  association  between  the  y’s  and  the  x’s.  The  square 
roots  of  the  eigenvalues,  r\,  r2, . . .  ,  rs,  are  called  canonical  correlations. 

The  best  overall  measure  of  association  is  the  largest  squared  canonical  correla¬ 
tion  (maximum  eigenvalue)  r2  of  S“f  SVA-S7v  SAy,  but  the  other  eigenvalues  (squared 
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canonical  correlations)  of  S“’SyAS]rA  SAV  provide  measures  of  supplemental  dimen¬ 
sions  of  (linear)  relationship  between  y  and  x.  As  an  alternative  approach,  it  can  be 
shown  that  r  j  is  the  maximum  squared  correlation  between  a  linear  combination  of 
the  y’s,  u  —  a'y,  and  a  linear  combination  of  the  x’s,  v  =  b'x;  that  is. 


n  =  max/Vy,b'x.  (11  -4) 

a.b 

We  denote  the  coefficient  vectors  that  yield  the  maximum  correlation  as  ai  and  bi. 
Thus  r\  (the  positive  square  root  of  r 2)  is  the  correlation  between  u  i  =  a^y  and 
V]  =  bjX.  The  coefficient  vectors  ai  and  bi  can  be  found  as  eigenvectors  [see  (1 1.7) 
and  (11.8)].  The  linear  functions  u\  and  tq  are  called  the  first  canonical  variates. 
There  are  additional  canonical  variates  n,  =  a'y  and  i>,-  —  b'x  corresponding  to  /q, 
n,  ■  ■  ■  ,  rs. 

It  was  noted  in  Section  2.1 1.5  that  the  (nonzero)  eigenvalues  of  AB  are  the  same 
as  those  of  BA  as  long  as  AB  and  BA  are  square  but  that  the  eigenvectors  of  AB  and 
BA  are  not  the  same.  If  we  let  A  =  S“!  SVA  and  B  =  S7ASAV,  then  r  2,  r|,  . . .  ,  r~  can 
also  be  obtained  from  BA  =  S“x  SAyS“!  SVA  as  well  as  from  AB  =  ST,1  SwS^1  SAV. 
Thus  the  eigenvalues  can  be  obtained  from  either  of  the  characteristic  equations 

|S-'S},AS-'SA},  -  r2I|  =  0,  (11.5) 

|S-'SAVS-'SVA  -  r2I|  =  0.  (11.6) 

The  coefficient  vectors  a,  and  b,  in  the  canonical  variates  n,-  =  a'y  and  i);  =  b'x  are 
the  eigenvectors  of  these  same  two  matrices: 

(S-y%xS-%y  -  r2I)a  =  0,  (11.7) 

(S-1SAVS-1SVA-r2I)b  =  0.  (11.8) 

Thus  the  two  matrices  S^1  SyA S^1  SA>,  and  S“l1SAVS“f  SVA  have  the  same  (nonzero) 
eigenvalues,  as  indicated  in  (11.5)  and  (1 1.6),  but  different  eigenvectors,  as  in  (1 1.7) 
and  (11.8).  Since  y  is  p  x  1  and  x  is  q  x  1,  the  a,  ’s  are  p  x  1  and  the  b,  ’s  are 
<7x1.  This  can  also  be  seen  in  the  sizes  of  the  matrices  in  (1 1.7)  and  (1 1.8);  that  is, 
S“JSVAS“1SA),  is  p  x  p  and  S“A1SAyS]Tv1SyA  is  q  x  q.  Since  p  is  typically  not  equal 
to  q,  the  matrix  that  is  larger  in  size  will  be  singular,  and  the  smaller  one  will  be 
nonsingular.  We  illustrate  for  p  <  q.  In  this  case  SVA  has  the  form 


q  p  p  q 
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When  p  <  q,  the  rank  of  S^S^yS  'Sy*  is  p,  because  SAJ  has  rank  q  and 
SjcyS^Sy*  has  rank  p.  In  this  case  p  eigenvalues  are  nonzero  and  the  remain¬ 
ing  q  —  p  eigenvalues  are  equal  to  zero.  In  general,  there  are  s  —  mini/?,  q)  values  of 
the  squared  canonical  correlation  rf  with  s  corresponding  pairs  of  canonical  variates 
m  =  a'y  and  t>,  =  b'x.  For  example,  if  p  —  3  and  q  —  7,  there  will  be  three 
canonical  correlations,  r\,  r^,  and  iq . 

Thus  we  have  s  canonical  correlations  r\,rq,  ■  ■  ■  ,  rs  corresponding  to  the  ,v  pairs 


of  canonical  variates  w,- 

and  Vi : 

r\ 

U\ 

"cf* 

II 

Vi 

=  b'jX 

n 

U2 

=  a;y 

V2 

=  bAx 

rs 

us 

"cf 

•  II 

vs 

-D 

'*  II 

For  each  i,  r,  is  the  (sample)  correlation  between  «,  and  v,- ;  that  is,  r,  =  rUjVj.  The 
pairs  (uj,  V;),  i  —  1,2,...  ,  s,  provide  the  s  dimensions  of  relationship.  For  simplic¬ 
ity,  we  would  prefer  only  one  dimension  of  relationship,  but  this  occurs  only  when 
i=l,  that  is,  when  p  =  1  or  q  —  1. 

The  s  dimensions  of  relationship  (uj,  t/),  i  =  1,  2, . . .  ,  s,  are  nonredundant. 
The  information  each  pair  provides  is  unavailable  in  the  other  pairs  because  ni, 
u 2,  ...  .  u v  are  uncorrelated.  They  are  not  orthogonal  because  ai,  Ski,  . . .  ,  av  are 
eigenvectors  of  SjT,1  Syx  SA  A  S_vv ,  which  is  nonsymmetric.  Similarly,  the  i>,  ’s  are  uncor¬ 
related,  and  each  m;  is  uncorrelated  with  all  vj,  j  ^  i,  except,  of  course,  v,-. 

We  examine  the  elements  of  the  coefficient  vectors  a,  and  b,  for  the  information 
they  provide  about  the  contribution  of  the  y’s  and  x’s  to  r; .  These  coefficients  can 
be  standardized,  as  noted  in  the  last  paragraph  in  the  present  section  and  in  Sec¬ 
tion  11.5.1. 

As  noted,  the  matrix  S“ !  Sy^S^1  Sxy  is  not  symmetric.  Many  algorithms  for  com¬ 
putation  of  eigenvalues  and  eigenvectors  accept  only  symmetric  matrices.  Since 
S^SyjSj^S,,  is  the  product  of  the  two  symmetric  matrices  Syy  and  SVA-SAA  SA-V, 
we  can  proceed  as  in  (6.23)  and  work  with  (U')_1S-y;cS“.l.  SX>,U_  ,  where  U'U  =  Sy;y 
is  the  Cholesky  factorization  of  Svy  (see  Section  2.7).  The  symmetric  matrix 
(U')_1Sy^S“^S^yU_1  has  the  same  eigenvalues  as  S“ !  SvlSAA  SAy  but  has  eigenvec¬ 
tors  Ua,-,  where  a,-  is  given  in  (11.7). 

In  effect,  the  pq  covariances  between  the  y’s  and  x’s  in  SyA  have  been  replaced 
by  s  =  min(p,  q)  canonical  correlations.  These  succinctly  summarize  the  relation¬ 
ships  between  y  and  x.  In  fact,  in  a  typical  study,  we  do  not  need  all  s  canonical 
correlations.  The  smallest  eigenvalues  can  be  disregarded  to  achieve  even  more  sim¬ 
plification.  As  in  (8.13)  for  discriminant  functions,  we  can  judge  the  importance  of 
each  eigenvalue  by  its  relative  size: 


r;= 


i  n 


(11.9) 
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The  canonical  correlations  can  also  be  obtained  from  the  partitioned  correlation 
matrix  of  the  y’s  and  x’s, 


R  = 


M 

R\t ) 


where  Ry>,  is  the  p  x  p  sample  correlation  matrix  of  the  y’s,  Rvv  is  the  p  x  q 
matrix  of  sample  correlations  between  the  y’s  and  the  x’s,  and  RVA  is  the  q  x  q 
sample  correlation  matrix  of  the  x’s.  The  matrix  R“!  R-y^R^1  Ryv  is  analogous  to 
R 2  =  r'y  Y  R“J  ryx  in  the  univariate  y  case.  The  characteristic  equations  corresponding 
to  (11.5)  and  (11.6), 


|R- 1  RyxR-}Rxy  -  r2I|  =  0,  (11.10) 

|R-1RvvR;y1Ry.l--r2I|  =  0,  (11.11) 

yield  the  same  eigenvalues  r2,  . . .  ,  r2  as  ( 1 1.5)  and  ( 1 1.6)  (the  canonical  corre¬ 

lations  are  scale  invariant;  see  property  1  in  Section  1 1.3). 

If  we  use  the  partitioned  correlation  matrix  in  place  of  the  covariance  matrix  in 
(11.7)  and  (11.8),  we  obtain  the  same  eigenvalues  (squared  canonical  correlations) 
but  different  eigenvectors: 


iRjR^RjR^v  -  r2I)c  =  0,  (11.12) 

(R-'R^R-'R,.,  -  r2I)d  =  0.  (11.13) 

The  relationship  between  the  eigenvectors  c  and  d  in  (11.12)  and  (11.13)  and  the 
eigenvectors  a  and  b  in  (1 1.7)  and  (11.8)  is 


c  =  Dya  and  d  =  Dvb, 


(11.14) 


where  Dy  =  diag(sVl ,  sy2, . . .  ,  syp )  and  Dv  =  diag(^Al ,  sX2 , . . .  ,  sXq). 

The  eigenvectors  c  and  d  in  (11.12),  (11.13),  and  (1 1.14)  are  standardized  coeffi¬ 
cient  vectors.  By  analogy  to  (8.15),  they  would  be  applied  to  standardized  variables. 
To  show  this,  note  that  in  terms  of  centered  variables  y  —  y,  we  have 


u  =  a'(y  -  y)  =  a  D  D  , 1  (y  -  y) 

=  c,D"1(y-y)  [by  (11.14)] 

yi  -  y\  .  w  -  y2  .  ,  yp~ 

—  C  l - b  C2 - b  •  •  •  +  C  p - 

sy\  syi  syP 


(11.15) 


Hence  c  and  d  are  preferred  to  a  and  b  for  interpretation  of  the  canonical  variates  u, 
and  Vi . 
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11.3  PROPERTIES  OF  CANONICAL  CORRELATIONS 

Two  interesting  properties  of  canonical  correlations  are  the  following  [for  other  prop¬ 
erties,  see  Rencher  (1998,  Section  8.3)]: 

1.  Canonical  correlations  are  invariant  to  changes  of  scale  on  either  the  y’s  or  the 
x’s.  For  example,  if  the  measurement  scale  is  changed  from  inches  to  centime¬ 
ters,  the  canonical  correlations  will  not  change  (the  corresponding  eigenvectors 
will  change).  This  property  holds  for  simple  and  multiple  correlations  as  well. 

2.  The  first  canonical  correlation  r \  is  the  maximum  correlation  between  linear 
functions  of  y  and  x.  Therefore,  r\  exceeds  (the  absolute  value  of)  the  simple 
correlation  between  any  y  and  any  x  or  the  multiple  correlation  between  any  y 
and  all  the  x’s  or  between  any  x  and  all  the  y’s. 

Example  11.3.  For  the  chemical  data  of  Table  10.1,  we  obtain  the  canonical  cor¬ 
relations  and  illustrate  property  2.  We  consider  the  extended  set  of  nine  x’s,  as  in 
Example  10.5.2.  The  matrix  Rvv  of  correlations  between  the  y’s  and  the  x’s  is 


Xl 

x2 

x3 

*1*2 

*1*3 

x2x3 

*\ 

x\ 

X3 

Vi 

-.68 

-.22 

-.45 

-.41 

-.55 

-.45 

-.68 

-.23 

-.42 

,V2 

.40 

.08 

.39 

.16 

.44 

.33 

.40 

.12 

.33 

,V3 

.58 

.23 

.36 

.40 

.45 

.39 

.58 

.22 

.36 

The  three  canonical  correlations  and  their  squares  are 

n  =  .9899  r\  =  .9800 

r2  =  .9528  rf  =  .9078 

r3  =  .4625  rf  =  .2139. 

From  the  relative  sizes  of  the  squared  canonical  correlations,  we  would  consider 
only  the  first  two  to  be  important.  A  hypothesis  test  for  the  significance  of  each  is 
carried  out  in  Example  1 1.4.2. 

To  confirm  that  property  2  holds  in  this  case,  we  compare  r\  =  .9899  to  the  indi¬ 
vidual  correlations  and  the  multiple  correlations.  We  first  note  that  .9899  is  greater 
than  individual  correlations,  since  (the  absolute  value  of)  the  largest  correlation  in 
Ry*  is  .68.  The  multiple  correlation  Ryj\x  of  each  y;  with  the  x’s  is  given  by 

Ryi  |x  =  -987,  Ry2  |x  =  .921,  Ry3  ,x  =  .906, 

and  for  the  multiple  correlation  of  each  x  with  the  y’s  we  have 

Ryi  |y  =  .691,  Rx  2|y  =  .237,  ^V3ly  =  -507, 

R.x]x2\y  —  -432,  ^xixsly  =  -585,  ^.V2*3|y  =  -482, 

*c?|y  =  -690,  Rx2ly  =  .234,  Rx  2,y  =  .466. 
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The  first  canonical  correlation  n  =  .9899  exceeds  all  multiple  correlations,  and 
property  2  is  satisfied.  □ 


11.4  TESTS  OF  SIGNIFICANCE 

In  the  following  two  sections  we  discuss  basic  tests  of  significance  associated  with 
canonical  correlations.  For  other  aspects  of  model  validation  for  canonical  correla¬ 
tions  and  variates,  see  Rencher  (1998,  Section  8.5). 

11.4.1  Tests  of  No  Relationship  between  the  y ’s  and  the  x’s 

In  Section  7.4.1,  we  considered  the  hypothesis  of  independence,  Ho :  %yx  =  O.  If 
SVY  =  O,  the  covariance  of  every  y;  with  every  Xj  is  zero,  and  all  correspond¬ 
ing  correlations  are  likewise  zero.  Hence,  under  Hq,  there  is  no  (linear)  relationship 
between  the  y’s  and  the  x’s,  and  Hq  is  equivalent  to  the  statement  that  all  canonical 
correlations  r\,  r2, . . .  ,  rs  are  nonsignificant.  Furthermore,  Hq  is  equivalent  to  the 
overall  regression  hypothesis  in  Section  10.5.1,  Hq:  Bi  =  O,  which  also  relates  all 
the  y’s  to  all  the  x’s.  Thus  by  (7.30)  or  (10.57),  the  significance  of  ri,  r2,  . . .  ,  rs  can 
be  tested  by 


Al 


|S| 


|R| 

I  R.vy  1 1  R.c.r  I 


(11.16) 


which  is  distributed  as  Ap  q  n-i-q.  We  reject  Hq  if  Ai  <  A„.  Critical  values  A„ 
are  available  in  Table  A. 9  using  vh  =  q  and  ve  =  n  —  1  —  q.  The  statistic  Ai  in 

(11.16)  is  also  distributed  as  Aqpn-\ -p.  As  in  (7.31),  Ai  is  expressible  in  terms  of 
the  squared  canonical  correlations: 


(11.17) 

1  =  1 


In  this  form,  we  can  see  that  if  one  or  more  rf  is  large,  Ai  will  be  small.  We  have 
used  the  notation  A  i  in  ( 1 1 . 1 6)  and  (11.17)  because  in  Section  1 1 .4.2  we  will  define 
A2,  A3  and  so  on  to  test  the  significance  of  n  and  succeeding  /y’s  after  the  first. 

If  the  parameters  exceed  the  range  of  critical  values  for  Wilks’  A  in  Table  A. 9, 
we  can  use  the  / 2 -approximation  in  (6.16), 


=  -  [»-  k(p  +  q  +  3) 


In  Aj, 


(11.18) 


which  is  approximately  distributed  as  /2  with  pq  degrees  of  freedom.  We  reject  Hq 
if  X2  >  x«  .  Alternatively,  we  can  use  the  F-approximation  given  in  (6.15): 


1  -  Aj/r  df2 
Aj/f  df? 


(11-19) 
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which  has  an  approximate  /^-distribution  with  df  i  and  df?  degrees  of  freedom,  where 


dfi  =  pq,  df2  =  wt  —  \pq  +  1, 


w  =  n  —  \(p  +  q  +  3),  t  — 

We  reject  Ho  if  F  >  Fa.  When  pq  —  2,  t  is  set  equal  to  1.  If  s  =  min( p.  q)  is  equal 
to  either  1  or  2,  then  the  F- approximation  in  (11.19)  has  an  exact  /-’-distribution. 
For  example,  if  one  of  the  two  sets  consists  of  only  two  variables,  an  exact  test  is 
afforded  by  the  /-’-approximation  in  (11.19).  In  contrast,  the  x 1  2 3 -approximation  in 
(11.18)  does  not  reduce  to  an  exact  test  for  any  parameter  values. 

The  other  three  multivariate  test  statistics  in  Sections  6.1.4,  6.1.5,  and  10.5.1  can 
also  be  used.  Pillai’s  test  statistic  for  the  significance  of  canonical  correlations  is 

V(s)  =  J2rf-  (11.20) 

/=i 

Upper  percentage  points  of  V>,'>  are  found  in  Table  A.l  1,  indexed  by 

5  =  min(/?,  q),  m  =  ^(|<7  —  p\  —  1),  N  =  |(«  —  q  —  p  —  2). 

For  /•’-approximations  for  V fv  1 ,  see  Section  6.1.5. 

The  Lawley-Hotelling  statistic  for  canonical  correlations  is 

s  r 2 

U(s)  =  V — - — T  •  (11.21) 

f-/  1  -r2 

1  =  1  l 

Upper  percentage  points  for  veU^’ /vh  (see  Section  6.1.5)  are  given  in  Table  A.12, 
which  is  entered  with  p,  v#  =  q .  and  ve  =  n  —  q  —  1.  For  /-’-approximations,  see 
Section  6.1.5. 

Roy’s  largest  root  statistic  is  given  by 


8  —  rj2.  (11.22) 

Upper  percentage  points  are  found  in  Table  A.  10,  with  s,m,  and  N  defined  as  before 
for  Pillai’s  test.  An  “upper  bound”  on  F  for  Roy’s  test  is  given  in  (6.21 ).  Even  though 
this  upper  bound  is  routinely  calculated  in  many  software  packages,  it  is  not  a  valid 
approximation. 

As  noted  at  the  beginning  of  this  section,  the  following  three  tests  are  equivalent: 

1.  Test  of  Hq  :  Xyx  =  O,  independence  of  two  sets  of  variables. 

2.  Test  of  Ho:  Bi  =  O,  significance  of  overall  multivariate  multiple  regression. 

3.  Test  of  significance  of  the  canonical  correlations. 


TESTS  OF  SIGNIFICANCE 


369 


Even  though  these  tests  are  equivalent,  we  have  discussed  them  separately  because 
each  has  an  extension  that  is  different  from  the  others.  The  respective  extensions  are 

1.  Test  of  independence  of  three  or  more  sets  of  variables  (Section  7.4.2), 

2.  Test  of  full  vs.  reduced  model  in  multivariate  multiple  regression  (Sec¬ 
tion  10.5.2), 

3.  Test  of  significance  of  succeeding  canonical  correlations  after  the  first  (Sec¬ 
tion  11.4.2). 

Example  11.4.1.  For  the  chemical  data  of  Table  10.1,  with  the  extended  set  of  nine 
x’s,  we  obtained  canonical  correlations  .9899,  .9528,  and  .4625  in  Example  11.3. 
To  test  the  significance  of  these,  we  calculate  the  following  four  test  statistics  and 
associated  approximate  F’s. 


Statistic 

Approximate 

F 

dfi 

df2 

p-Value 
for  F 

Wilks’  A  =  .00145 

6.537 

27 

21.09 

<  .0001 

Pillai’s  =  2.10 

2.340 

27 

27 

.0155 

Lawley-Hotelling  U(s)  =  59.03 

12.388 

27 

17 

<  .0001 

Roy’s  6  =  .980 

48.908 

9 

9 

<  .0001 

The  F  approximation  for  Roy’s  test  is,  of  course,  an  “upper  bound.”  Rejection 
of  Ho  in  these  tests  implies  that  at  least  is  significantly  different  from  zero.  The 
question  of  how  many  rf’s  are  significant  is  treated  in  the  next  section.  □ 

11.4.2  Test  of  Significance  of  Succeeding  Canonical  Correlations 
after  the  First 

If  the  test  in  ( 1 1 . 1 7)  based  on  all  s  canonical  correlations  rejects  Hq,  we  are  not  sure 
if  the  canonical  correlations  beyond  the  first  are  significant.  To  test  the  significance 
of  >'2, .. .  ,  rs,  we  delete  from  Ai  in  (1 1.17)  to  obtain 

A2=r[(i-'-i2)-  (ii-23) 

i'=2 

If  this  test  rejects  the  hypothesis,  we  conclude  that  at  least  r2  is  significantly  different 
from  zero.  We  can  continue  in  this  manner,  testing  each  r,  in  turn,  until  a  test  fails  to 
reject  the  hypothesis.  At  the  kth  step,  the  test  statistic  is 

S 

M  =  na  -  rf),  (11.24) 

i=k 


which  is  distributed  as  Kp-k+\,q-k+\,n-k-q  and  tests  the  significance  of  r k,  r*+i, 
...  ,  rs .  (These  test  statistics  are  analogous  to  those  for  discriminant  functions  in 
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Section  1 1 .4.2.)  Note  that  each  parameter  is  reduced  by  k  —  1  from  the  parameter 
values  p,  q,  and  n  —  1  —  q  for  Aj  in  (11.16)  or  (11.17). 

The  usual  y~-  and  ^-approximations  can  also  be  applied  to  A&.  The  y1- 
approximation  analogous  to  (11.18)  is  given  by 


-  [«  -  j(P 


+  <7  +  3) 


In  Ak, 


(11.25) 


which  has  ( p  —  k  +  1  )(q  —  k  +  1)  degrees  of  freedom.  The  E- approximation  for  A/; 
is  a  simple  modification  of  (11.19)  and  the  accompanying  parameter  definitions.  In 
place  of  p,  q,  and  n,  we  use  p  —  k  +  1,  q  —  k  +  1,  and  n  —  k  +  1  to  obtain 


1  -  A1/1  df2 

A]"  dft’ 


where 


df,  =  (p-k+l)(q-k+l), 
df2  =  wt  -  \[(p  -k+  1  )(q  ~k+  1)]  +  1, 
w  =  n—  \(p  +  q  +  3), 


t 


'  (p  —  k  +  1)2(<7  —  k  +  l)2  —  4 
(p  -  k  +  l)2  +  (q  -  k  +  l)2  -  5 


Example  11.4.2.  We  continue  our  analysis  of  the  canonical  correlations  for  the 
chemical  data  in  Table  10.1  with  three  y’s  and  nine  .r  s.  The  tests  are  summarized  in 
Table  11.1. 

In  the  case  of  A2,  we  have  a  discrepancy  between  the  exact  Wilks  A-test  and 
the  approximate  E-test.  The  test  based  on  A  is  not  significant,  whereas  the  E-test 
does  reach  significance.  This  illustrates  the  need  to  check  critical  values  for  exact 
tests  whenever  /^-values  for  approximate  tests  are  close  to  the  nominal  value  of  a. 
From  the  test  using  A,  we  conclude  that  only  r\  =  .9899  is  significant.  The  rela¬ 
tive  sizes  of  the  squared  canonical  correlations,  .980,  .908,  and  .214,  would  indicate 
two  dimensions  of  relationship,  but  this  is  not  confirmed  by  the  Wilks’  test,  perhaps 
because  of  the  small  sample  size  relative  to  the  number  of  variables  (p  +  q  =  12 
and  n  =  19). 


Table  11.1.  Tests  of  Three  Canonical  Correlations  of  the  Chemical  Data 


k 

A k 

A. 05 

Approximate  F 

dfi 

df2 

p-Value  for  F 

i 

.00145 

.024 

6.537 

27 

21.1 

<  .0001 

2 

.0725 

.069 

2.714 

16 

16 

.0269 

3 

.786 

.209 

.350 

7 

9 

.91 
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To  illustrate  the  computations,  we  obtain  the  values  in  Table  11.1  for  k  —  2. 
Using  (1 1.24),  the  computation  for  A2  is 

3 

A2  =  Y\(  1  -  rf)  =  (1  -  -908) (1  -  .214)  =  .0725. 

;'= 2 

With  k  =  2,  p  =  3,  q  =  9,  and  n  —  19,  the  critical  value  for  A 2  is  obtained  from 
Table  A. 9  as 


A.05,p-k+l,q-k+l,n-k-q  =  A  .05, 2,8.8  =  -069. 
For  the  approximate  F  for  A2,  we  have 


t 


1  (3-2+  1)2(9- 2+  l)2-4 
(3-2  +  l)2 +  (9-2  +  l)2-5 


-  2, 


w  =  19-  ^(3  +  9  +  3)  =  11.5, 

df  1  =  (3  —  2  +  1)(9  —  2+1)  =  16, 

df2  =  (11. 5) (2)  -  i[(3  -  2  +  1)(9  -  2  +  1)]  +  1  =  16, 

1  -  (.0725) ^2  16 

F  =  - - - r4 - =  2.714. 

(0.725)  !/2  16 


□ 


11.5  INTERPRETATION 

We  now  turn  to  an  assessment  of  the  information  contained  in  the  canonical  correla¬ 
tions  and  canonical  variates.  As  was  done  for  discriminant  functions  in  Section  8.7,  a 
distinction  can  be  made  between  interpretation  of  the  canonical  variates  and  assess¬ 
ing  the  contribution  of  each  variable.  In  the  former,  the  signs  of  the  coefficients  are 
considered;  in  the  latter,  the  signs  are  ignored  and  the  coefficients  are  ranked  in  order 
of  absolute  value. 

In  Sections  11.5.1-11.5.3,  we  discuss  three  common  tools  for  interpretation  of 
canonical  variates:  (1)  standardized  coefficients,  (2)  the  correlation  between  each 
variable  and  the  canonical  variate,  and  (3)  rotation  of  the  canonical  variate  coef¬ 
ficients.  The  second  of  these  is  the  most  widely  recommended,  but  we  note  in  Sec¬ 
tion  1 1.5.2  that  it  is  the  least  useful.  In  fact,  for  reasons  to  be  outlined,  we  recommend 
only  the  first,  standardized  coefficients.  In  Section  11.5.4,  we  describe  redundancy 
analysis  and  discuss  its  shortcomings  as  a  measure  of  association  between  two  sets 
of  variables. 

11.5.1  Standardized  Coefficients 

The  coefficients  in  the  canonical  variates  w,  =  a  -  v  and  u;  =  b  -  x  reflect  differences 
in  scaling  of  the  variables  as  well  as  differences  in  contribution  of  the  variables  to 
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canonical  correlation.  To  remove  the  effect  of  scaling,  a,  and  b,  can  be  standard¬ 
ized  by  multiplying  by  the  standard  deviations  of  the  corresponding  variables  as 
in  (11.14): 


c,=Dya,,  d,  =  DAb,, 


where  Dy  =  diag(.s'yi ,  syi, . . .  ,  sy,p )  and  Dv  =  diag(.sX| ,  sX2, ...  ,  sXq).  Alterna¬ 
tively,  c t  and  d,  can  be  obtained  directly  from  (11.12)  and  (11.13)  as  eigenvectors 
of  R7,1  Rva-  R_7a'  Rav  and  R“A*  Rxy R“y  RyA ,  respectively.  It  was  noted  at  the  end  of 
Section  11.2  that  the  coefficients  in  c;  are  applied  to  standardized  variables  [see 
(11.15)  ].  Thus  the  effect  of  differences  in  size  or  scaling  of  the  variables  is  removed, 
and  the  coefficients  c;i,  c/2,  •  • .  ,  Cjp  in  c /  reflect  the  relative  contribution  of  each  of 
yi,  yi,  ■  ■  ■  ,  yp  to  u,.  A  similar  statement  can  be  made  about  d/. 

The  standardized  coefficients  show  the  contribution  of  the  variables  in  the  pres¬ 
ence  of  each  other.  Thus  if  some  of  the  variables  are  deleted  and  others  added,  the 
coefficients  will  change.  This  is  precisely  the  behavior  we  desire  from  the  coefficients 
in  a  multivariate  setting. 


Example  11.5.1.  For  the  chemical  data  in  Table  10.1  with  the  extended  set  of  nine 
x’s,  we  obtain  the  following  standardized  coefficients  for  the  three  canonical  variates: 


Cl 

c2 

c3 

Vl 

1.5360 

4.4704 

5.7961 

T2 

.2108 

2.8291 

2.2280 

T3 

.4676 

3.1309 

5.1442 

Thus 


di 

^2 

^3 

Xl 

5.0125 

-38.3053 

-12.5072 

x2 

5.8551 

-17.7390 

-24.2290 

X3 

1.6500 

-7.9699 

-32.7392 

XIX2 

-3.9209 

19.2937 

11.6420 

XIX3 

-2.2968 

6.4001 

31.2189 

X2X3 

.5316 

.8096 

1.2988 

xi 

-2.6655 

32.7933 

4.8454 

x\ 

-1.2346 

-3.3641 

10.7979 

x\ 

.5703 

.8733 

.9706 

M1  =  L54*Z2i  +  .21*1^  +  .47*^13, 


*yi 


2 


*y3 


—  —  2—2 

X\  —  X]  X?  —  X2  X-X  ~  xi. 

v\  =  5.01— - -  +  5.86— - -  +  •  •  •  +  .57— - 


*Xl 


^X2 


sx 2 
■*3 
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The  variables  that  contribute  most  to  the  correlation  between  u\  and  v  \  are  y\  and 
x\,  xj,  x i x 2 ,  x i  A'3 ,  x\.  The  correlation  between  ui  and  Vi  is  due  largely  to  all  three 
y’s  and  x\,  xi,  x\xo,  xf.  □ 

11.5.2  Correlations  between  Variables  and  Canonical  Variates 

Many  writers  recommend  the  additional  step  of  converting  the  standardized  coef¬ 
ficients  to  correlations.  Thus,  for  example,  in  Cj  =  (cn,  C12,  . . .  ,  cip),  instead  of 
the  second  coefficient  C12  we  could  examine  ry2U1,  the  correlation  between  y2  and 
the  first  canonical  variate  u  \ .  Such  correlations  are  sometimes  referred  to  as  load¬ 
ings  or  structure  coefficients ,  and  it  is  widely  claimed  that  they  provide  a  more  valid 
interpretation  of  the  canonical  variates.  Rencher  (1988;  1992b;  1998,  Section  8.6.3) 
has  shown,  however,  that  a  weighted  sum  of  the  correlations  between  yj  and  the 
canonical  variates  u\,  112,  ■  ■  ■  ,  us  is  equal  to  R~,  |X,  the  squared  multiple  correlation 
between  y j  and  the  x’s.  There  is  no  information  about  how  the  y’s  contribute  jointly 
to  canonical  correlation  with  the  x’s.  Therefore,  the  correlations  are  useless  in  gaug¬ 
ing  the  importance  of  a  given  variable  in  the  context  of  the  others.  The  researcher 
who  uses  these  correlations  for  interpretation  is  unknowingly  reducing  the  multivari¬ 
ate  setting  to  a  univariate  one. 


11.5.3  Rotation 

In  an  attempt  to  improve  interpretability,  the  canonical  variate  coefficients  can  be 
rotated  (see  Section  13.5)  to  increase  the  number  of  high  and  low  coefficients  and 
reduce  the  number  of  intermediate  ones. 

We  do  not  recommend  rotation  of  the  canonical  variate  coefficients  for  two  rea¬ 
sons  [for  proof  and  further  discussion,  see  Rencher  (1992b)]: 

1.  Rotation  destroys  the  optimality  of  the  canonical  correlations.  For  example,  the 
first  canonical  correlation  is  reduced  and  is  no  longer  equal  to  maxa  1,  ra  v  b/x 
as  in  (11.4). 

2.  Rotation  introduces  correlations  among  succeeding  canonical  variates.  Thus, 
for  example,  u\  and  u  2  are  correlated  after  rotation.  Hence  even  though  the 
resulting  coefficients  may  offer  a  subjectively  more  interpretable  pattern,  this 
gain  is  offset  by  the  increased  complexity  due  to  interrelationships  among  the 
canonical  variates.  For  example,  U2  and  Vj  no  longer  offer  a  new  dimension 
of  relationship  uncorrelated  with  u  1  and  V] .  The  dimensions  now  overlap,  and 
some  of  the  information  in  1/2  and  V2  is  already  available  in  u\  and  v\ . 


11.5.4  Redundancy  Analysis 

The  redundancy  is  a  measure  of  association  between  the  v’s  and  the  x’s  based  on  the 
correlations  between  variables  and  canonical  variates  discussed  in  Section  11.5.2. 
Since  these  correlations  provide  only  univariate  information,  the  redundancy  turns 
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out  to  be  a  univariate  rather  than  a  multivariate  measure  of  relationship.  If  the  squared 
multiple  correlation  of  yj  regressed  on  the  x’s  is  denoted  by  R2  ,  then  the  redun¬ 
dancy  of  the  y’s  given  the  v’s  is  the  average  squared  multiple  correlation: 

s>2 

;=1  Av;|x 

Rd(y|v)  =  — - - (11.26) 

P 

Similarly,  the  redundancy  of  the  x’s  given  the  u’s  is  the  average 

Rd(x|u)  =  ^=1  R*j\y '  (11.27) 

q 

where  R2 ,|  is  the  squared  multiple  correlation  of  Xj  regressed  on  the  y’s.  Since 
Rd(y|v)  in  (11.26)  is  the  average  squared  multiple  correlation  of  each  yj  regressed 
on  the  x’s,  it  does  not  take  into  account  the  correlations  among  the  y’s.  It  is  thus 
an  average  univariate  measure  of  relationship  between  the  y’s  and  the  x’s,  not  a 
multivariate  measure  at  all.  The  two  redundancy  measures  in  ( 1 1 .26)  and  (11 .27)  are 
not  symmetric;  that  is,  Rd(y|v)  ^  Rd(x|u). 

Thus  the  so-called  redundancy  does  not  really  quantify  the  redundancy  among  the 
y’s  and  x’s  and  is,  therefore,  not  a  useful  measure  of  association  between  two  sets  of 
variables.  For  a  measure  of  association  we  recommend  r2  itself. 


11.6  RELATIONSHIPS  OF  CANONICAL  CORRELATION  ANALYSIS 
TO  OTHER  MULTIVARIATE  TECHNIQUES 

In  Section  11.4.1,  we  noted  the  equivalence  of  the  test  for  significance  of  the  canon¬ 
ical  correlations  and  the  test  for  significance  of  overall  regression,  H$  :  Bi  =  O. 
Additional  relationships  between  canonical  correlation  and  multivariate  regression 
are  developed  in  Section  1 1.6.1.  The  relationship  of  canonical  correlation  analysis  to 
MANOVA  and  discriminant  analysis  is  discussed  in  Section  1 1.6.2. 


11.6.1  Regression 

There  is  a  direct  link  between  canonical  variate  coefficients  and  multivariate  multiple 
regression  coefficients.  The  matrix  of  regression  coefficients  of  the  y’s  regressed  on 
the  x’s  (corrected  for  their  means)  is  given  in  (10.52)  as  Bi  =  S^1  Svv.  This  matrix 
can  be  used  to  relate  a,-  and  b, : 


b,  =  Bia,.  (11.28) 

(Since  a,  and  b,  are  eigenvectors,  (1 1.28)  could  also  be  written  as  b,  =  cBia,  ,  where 
c  is  an  arbitrary  scale  factor.]  By  (2.67)  and  (1 1.28),  the  canonical  variate  coefficient 
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vector  b,  is  expressible  as  a  linear  combination  of  the  columns  of  B  i .  A  similar 
expression  for  a,  can  be  obtained  from  the  regression  of  x  on  y:  a,-  —  S„!  Sy,T  b, . 

In  Section  11.2,  canonical  correlation  was  defined  as  an  extension  of  multiple 
correlation.  Correspondingly,  canonical  correlation  reduces  to  multiple  correlation 
when  one  of  the  two  sets  of  variables  has  only  one  variable.  When  p  —  1,  for 
example,  Rvv  becomes  1,  and  by  (11.10),  the  single  squared  canonical  correlation 
reduces  to  r 2  =  r'  R“x!  ryx ,  which  we  recognize  from  (10.34)  as  R2. 

The  two  Wilks’  test  statistics  in  multivariate  regression  in  Sections  10.5.1  and 
10.5.2,  namely,  the  test  for  overall  regression  and  the  test  on  a  subset  of  the  x’s,  can 
both  be  expressed  in  terms  of  canonical  correlations.  By  (10.55)  and  (11.17),  the  test 
statistic  for  the  overall  regression  hypothesis  Hq  :  B  |  =  O  can  be  written  as 

|Y'Y  -  B'X'Y| 

A/  ”  |Y'Y  -  ny  y'| 

=  fid  -  rf), 

1=1 

where  rf  is  the  ith  squared  canonical  correlation. 

A  test  statistic  for  Hq:  B^  =  O,  the  hypothesis  that  the  y’s  do  not  depend  on  the 
last  h  of  the  x’s,  is  given  by  (10.65)  as 


(11.29) 

(11.30) 


A(xq-h+i,...  ,xq\x i, 


A  f 

>  Xq—h)  =  — 

A, 


where  A  f  is  given  in  (11.29)  and  Ar  is  given  in  (10.64)  as 


|y'y  -  b;.x;yi 

|Y'Y-nyf|  ' 


(11.31) 


(11.32) 


By  analogy  with  (11.30),  A,-  can  be  expressed  in  terms  of  the  squared  canonical 
correlations  c2,  c\, . . .  ,cf  between  yi,  yi,  ■  ■  ■  ,  yp  and  x\,  xt,  ■  ■  ■  ,  xq-h'. 


Ar  =  Y\(l-cf),  (11.33) 

/=! 


where  t  =  min( p.  q  —  h).  We  have  used  the  notation  cj  instead  of  rf  to  emphasize 
that  the  canonical  correlations  in  the  reduced  model  differ  from  those  in  the  full 
model.  By  (11.30)  and  (11.33),  the  full  and  reduced  model  test  of  Hq  :  —  O  in 

(11.31)  can  now  be  expressed  in  terms  of  canonical  correlations  as 


A{xq-h+ ,xq\x\ - ,Xq-h) 


nua-^) 
nud  -c?y 


(11.34) 
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If  p  =  1,  as  in  multiple  regression,  then  s  —  t  —  1,  and  (1 1.34)  reduces  to 


A  = 


1  -  R 


f 


\-rV 


(11.35) 


where  Rj  and  Rp  are  the  squared  multiple  correlations  for  the  full  model  and  for 
the  reduced  model.  The  distribution  of  A  in  (11.35)  is  Aj  /,  „_9_i  when  Ho  is  true. 
Since  p  —  1,  there  is  an  exact  A-transformation  from  Table  6.1, 

(1  -  A)(n  -  <?  -  1) 

F  = - , 

Ah 


which  is  distributed  as  A/,  „_?_i  when  Ho  is  true.  Substitution  of  A  =  (1  —  Ry)/(l  — 
Rp)  from  (1 1.35)  yields  the  A’-stati Stic  expressed  in  terms  of  R2, 


(R)  -  Rp){n  ~q~  1) 

(1  -  Rj)h 


(11.36) 


as  given  in  (10.33). 

Subset  selection  in  canonical  correlation  analysis  can  be  handled  by  the  methods 
for  multivariate  regression  given  in  Section  10.7.  A  subset  of  x’s  can  be  found  by 
the  procedure  of  Section  10.7.1a.  After  a  subset  of  x’s  is  found,  the  approach  in 
Section  10.7.1b  can  be  used  to  select  a  subset  of  v’s. 

Muller  (1982)  discussed  the  relationship  of  canonical  correlation  analysis  to  mul¬ 
tivariate  regression  and  principal  components.  (Principal  components  are  treated  in 
Chapter  12.) 

11.6.2  MANOYA  and  Discriminant  Analysis 

In  Sections  6.1.8  and  8.4.2,  it  was  noted  that  in  a  one-way  MANOVA  or  discriminant 
analysis  setting,  A; /(I  +  A,-)  is  equal  to  rf,  where  A,-  is  the  r th  eigenvalue  of  E-1H 
and  rp  is  the  z'th  squared  canonical  correlation  between  the  p  dependent  variables 
and  the  k  —  1  grouping  variables.  We  now  give  a  justification  of  this  assertion. 

Let  the  dependent  variables  be  denoted  by  y\,  yj, . . .  ,  yP ,  as  usual.  We  represent 
the  k  groups  by  k  —  1  dummy  variables,  x\,X2, ...  ,  x\-  \ ,  defined  for  each  member  of 
the  i th  group,  i  <  k  —  1,  as  x\  =  0,  . . .  ,  x,_i  =  0,  x,-  =  1,  x,+i  =  0,  . . .  ,  Xk-i  =  0. 
For  the  Ath  group,  all  x’s  are  zero.  (See  Section  6.1.8  for  an  introduction  to  dummy 
variables.)  To  illustrate  with  k  —  4,  the  x’s  are  defined  as  follows  in  each  group: 

Group  xi  X2  X3 

1  10  0 

2  0  10 

3  0  0  1 

4  0  0  0 
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The  MAN OVA  model  is  equivalent  to  multivariate  regression  of  yi,  yi, . . .  ,  yp 
on  the  dummy  grouping  variables  x\,xi, ...  ,  Xk- 1  •  The  MANOVA  test  of  Hq  :  /X|  = 
fX2  —  •••  —  fjLfc  is  equivalent  to  the  multivariate  regression  test  of  Ho :  B  i  =  ('),  as 
given  by  (11.17), 


5 

A=n<!-f)-  (u-37) 

i=l 

When  we  compare  this  form  of  A  to  the  MANOVA  test  statistic  (6.14), 


s  1 

a  =  TT - , 

Hj  +  a,- 


we  obtain  the  relationships 


1  A; 
r?  — 


1  +  A  i 


A ,  = 


J2  ’ 


1  —  ri 

To  establish  this  relationship  more  formally,  we  write  (6.22)  as 

Ha  =  AEa 


(11.38) 


(11.39) 

(11.40) 


(11.41) 


and  (11.7)  as 

S-y%xS-%ysi  =  r2a.  (11.42) 

We  multiply  (1 1.42)  on  the  left  by  Svv  to  obtain 

SV1S,  ,'S,  va  =  r2S-yVa.  (1 1.43) 

Using  the  centered  matrix  Xc  in  (10.14),  with  an  analogous  definition  for  Yc,  we  can 
write  Bj  in  the  form  [see  (10.52)] 


Bi  = 


X'XC 
n  —  1 


XfYc 

n  —  1 


=  ST1  S 


XX  ^ xy  ■ 


In  terms  of  centered  matrices,  E  =  Y'Y  —  B  X' Y  in  (10.49)  can  be  written  as 

E  YgYc  B'|  X;.YC 
n  —  1  n  —  1  n  —  1 

=  S-yy  —  S^yS^Sjy  =  Syy  —  SyX  S.Xy  , 


(11.44) 
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since  S'  =  SYX.  Similarly, 


H 

n  —  1 


fr',  x;.y<; 

n  —  1 


—  SyjlSj.jSjy. 


(11.45) 


Since  MANOVA  is  equivalent  to  multivariate  regression  on  dummy  grouping  vari¬ 
ables,  we  can  substitute  these  values  of  E  and  H  into  (1 1 .41)  to  obtain 

Sy.vS-'S.vya  =  MSyy  ~  S  y.V  S“ f  SXy  )& .  (1  1.46) 

Subtracting  r-Sy^S^S^ya  from  both  sides  of  (11.43)  gives 

r 2 

SyXSjSXySi  =  ^  ^  (Syy  ~  S  yX  S.J  S*y )  H.  (1  1.47) 

A  comparison  of  (11.46)  and  (11.47)  shows  that 


as  in  (11.40).  Lindsey,  Webster,  and  Halpern  (1985)  discussed  some  advantages  of 
using  canonical  correlation  analysis  in  place  of  discriminant  analysis  in  the  several- 
group  case. 


PROBLEMS 

11.1  Show  that  the  expression  for  canonical  correlations  in  (11.12)  can  be  obtained 
from  the  analogous  expression  in  terms  of  variances  and  covariances  in  (1 1 .7). 

11.2  Verify  (11.28),  b/  =B1a,. 

11.3  Verify  (11.35)  for  A  when  p  —  s  =  t  =  1. 

11.4  Verify  the  expression  in  (1 1.36)  for  F  in  terms  of  Ry  and  Rj. . 

11.5  Solve  (11.39),  rf  =  A, /(l  +  A.*),  for  X,  to  obtain  (11.40). 

11.6  Verify  (11.46),  Sy^S^1  S.^a  =  A(Syy  -  Sy.vS“1SA-v)a. 

11.7  Show  that  (11.47)  can  be  obtained  by  subtracting  r2SVA.S(r;tl  S^a  from  both 
sides  of  (11.43). 

11.8  Use  the  diabetes  data  of  Table  3.4. 

(a)  Find  the  canonical  correlations  between  (yi ,  >’2)  and  (x\ ,  xj,  X3). 

(b)  Find  the  standardized  coefficients  for  the  canonical  variates. 

(c)  Test  the  significance  of  each  canonical  correlation. 
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11.9  Use  the  sons  data  of  Table  3.7. 

(a)  Find  the  canonical  correlations  between  (yi,  y2)  and  (x\ ,  xy )  ■ 

(b)  Find  the  standardized  coefficients  for  the  canonical  variates. 

(c)  Test  the  significance  of  each  canonical  correlation. 

11.10  Use  the  glucose  data  of  Table  3.8. 

(a)  Find  the  canonical  correlations  between  (yi,  V2,  V3 )  and  (x\,  xy,  xy). 

(b)  Find  the  standardized  coefficients  for  the  canonical  variates. 

(c)  Test  the  significance  of  each  canonical  correlation. 

11.11  Use  the  Seishu  data  of  Table  7.1. 

(a)  Find  the  canonical  correlations  between  (yi,  V2)  and  (x  1 ,  xy,  ■  ■  ■  ,  x8). 

(b)  Find  the  standardized  coefficients  for  the  canonical  variates. 

(c)  Test  the  significance  of  each  canonical  correlation. 

11.12  Use  canonical  correlation  to  carry  out  the  tests  in  parts  (b),  (c),  and  (d)  of 
Problem  10.17,  using  the  Seishu  data.  You  will  need  to  find  the  canonical 
correlations  between  (vi ,  yi)  and  the  x’s  in  the  indicated  reduced  models  and 
use  (11.34). 

11.13  Using  the  temperature  data  of  Table  7.2,  find  the  canonical  correlations  and 
the  standardized  coefficients  and  carry  out  significance  tests  for  the  following: 

(a)  Oh,  >’2,  B)  and  (y4,  y5,  y6) 

(b)  (yi,  J2 - -  ye)  and  (yy,  y8,  ysO 

(c)  Oh,  J2 - -  y9)  and  Oho,  }hi) 

(d)  (yi,  y2 - -  ye)  and  (yy,  y8, . . .  ,  yn). 
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Principal  Component  Analysis 


12.1  INTRODUCTION 

In  principal  component  analysis,  we  seek  to  maximize  the  variance  of  a  linear  com¬ 
bination  of  the  variables.  For  example,  we  might  want  to  rank  students  on  the  basis 
of  their  scores  on  achievement  tests  in  English,  mathematics,  reading,  and  so  on.  An 
average  score  would  provide  a  single  scale  on  which  to  compare  the  students,  but 
with  unequal  weights  we  can  spread  the  students  out  further  on  the  scale  and  obtain 
a  better  ranking. 

Essentially,  principal  component  analysis  is  a  one-sample  technique  applied  to 
data  with  no  groupings  among  the  observations  as  in  Chapters  8  and  9  and  no  parti¬ 
tioning  of  the  variables  into  subsets  y  and  x,  as  in  Chapters  10  and  1 1.  All  the  linear 
combinations  that  we  have  considered  previously  were  related  to  other  variables  or 
to  the  data  structure.  In  regression,  we  have  linear  combinations  of  the  independent 
variables  that  best  predict  the  dependent  variable(s);  in  canonical  correlation,  we 
have  linear  combinations  of  a  subset  of  variables  that  maximally  correlate  with  lin¬ 
ear  combinations  of  another  subset  of  variables;  and  discriminant  analysis  involves 
linear  combinations  that  maximally  separate  groups  of  observations.  Principal  com¬ 
ponents,  on  the  other  hand,  are  concerned  only  with  the  core  structure  of  a  single 
sample  of  observations  on  p  variables.  None  of  the  variables  is  designated  as  depen¬ 
dent,  and  no  grouping  of  observations  is  assumed.  [For  a  discussion  of  the  use  of 
principal  components  with  data  consisting  of  several  samples  or  groups,  see  Rencher 
(1998,  Section  9.9)]. 

The  first  principal  component  is  the  linear  combination  with  maximal  variance; 
we  are  essentially  searching  for  a  dimension  along  which  the  observations  are  max¬ 
imally  separated  or  spread  out.  The  second  principal  component  is  the  linear  com¬ 
bination  with  maximal  variance  in  a  direction  orthogonal  to  the  first  principal  com¬ 
ponent,  and  so  on.  In  general,  the  principal  components  define  different  dimensions 
from  those  defined  by  discriminant  functions  or  canonical  variates. 

In  some  applications,  the  principal  components  are  an  end  in  themselves  and 
may  be  amenable  to  interpretation.  More  often  they  are  obtained  for  use  as  input  to 
another  analysis.  For  example,  two  situations  in  regression  where  principal  compo¬ 
nents  may  be  useful  are  (1)  if  the  number  of  independent  variables  is  large  relative  to 
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the  number  of  observations,  a  test  may  be  ineffective  or  even  impossible,  and  (2)  if 
the  independent  variables  are  highly  correlated,  the  estimates  of  repression  coeffi¬ 
cients  may  be  unstable.  In  such  cases,  the  independent  variables  can  be  reduced  to  a 
smaller  number  of  principal  components  that  will  yield  a  better  test  or  more  stable 
estimates  of  the  regression  coefficients.  For  details  of  this  application,  see  Rencher 
(1998,  Section  9.8). 

As  another  illustration,  suppose  that  in  a  MAN OVA  application  p  is  close  to  ve, 
so  that  a  test  has  low  power,  or  that  p  >  ve,  in  which  case  we  have  so  many  depen¬ 
dent  variables  that  a  test  cannot  be  made.  In  such  cases,  we  can  replace  the  dependent 
variables  with  a  smaller  set  of  principal  components  and  then  carry  out  the  test. 

In  these  illustrations,  principal  components  are  used  to  reduce  the  number  of 
dimensions.  Another  useful  dimension  reduction  device  is  to  evaluate  the  first  two 
principal  components  for  each  observation  vector  and  construct  a  scatter  plot  to 
check  for  multivariate  normality,  outliers,  and  so  on. 

Finally,  we  note  that  in  the  term  principal  components,  we  use  the  adjective  prin¬ 
cipal,  describing  what  kind  of  components — main,  primary,  fundamental,  major,  and 
so  on.  We  do  not  use  the  noun  principle  as  a  modifier  for  components. 


12.2  GEOMETRIC  AND  ALGEBRAIC  BASES 
OF  PRINCIPAL  COMPONENTS 

12.2.1  Geometric  Approach 

As  noted  in  Section  12.1,  principal  component  analysis  deals  with  a  single  sample  of 
n  observation  vectors  yi,  y2,  ■  ■  •  ,  y«  that  form  a  swarm  of  points  in  a  /j-dimensional 
space.  Principal  component  analysis  can  be  applied  to  any  distribution  of  y,  but  it 
will  be  easier  to  visualize  geometrically  if  the  swarm  of  points  is  ellipsoidal. 

If  the  variables  vi,  yi,  . . .  ,  yp  in  y  are  correlated,  the  ellipsoidal  swarm  of  points 
is  not  oriented  parallel  to  any  of  the  axes  represented  by  y\,  yi, .  ■  .  ,  yp .  We  wish  to 
find  the  natural  axes  of  the  swarm  of  points  (the  axes  of  the  ellipsoid)  with  origin  at 
y,  the  mean  vector  of  yi,  y2,  . . .  ,  y„.  This  is  done  by  translating  the  origin  to  y  and 
then  rotating  the  axes.  After  rotation  so  that  the  axes  become  the  natural  axes  of  the 
ellipsoid,  the  new  variables  (principal  components)  will  be  uncorrelated. 

We  could  indicate  the  translation  of  the  origin  to  y  by  writing  y,  —  y,  but  we  will 
not  usually  do  so  for  economy  of  notation.  We  will  write  y,  —  y  when  there  is  an 
explicit  need;  otherwise  we  assume  that  y,-  has  been  centered. 

The  axes  can  be  rotated  by  multiplying  each  y,  by  an  orthogonal  matrix  A  [see 
(2.101),  where  the  orthogonal  matrix  was  denoted  by  C]: 

Z;  =  Ay,-.  (12.1) 

Since  A  is  orthogonal,  A' A  =  I,  and  the  distance  to  the  origin  is  unchanged: 


M  =  (Ayi.)'(Ay,)  =  y/A'Ay,-  =  yjy. 
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[see  (2.103)].  Thus  an  orthogonal  matrix  transforms  y,-  to  a  point  z,  that  is  the  same 
distance  from  the  origin,  and  the  axes  are  effectively  rotated. 

Finding  the  axes  of  the  ellipsoid  is  equivalent  to  finding  the  orthogonal  matrix  A 
that  rotates  the  axes  to  line  up  with  the  natural  extensions  of  the  swarm  of  points  so 
that  the  new  variables  (principal  components)  zi,  Z2,  ■  ■  ■  ,  ZP  in  z  =  Ay  are  uncorre¬ 
lated.  Thus  we  want  the  sample  covariance  matrix  of  z,  S-  =  ASA'  [see  (3.64)],  to 
be  diagonal: 


S-  =  ASA'  = 


o\ 

0 


\  0  0 


(12.2) 


where  S  is  the  sample  covariance  matrix  of  yi,  y2, . . .  ,  y„.  By  (2.1 1 1),  C'SC  =  D  = 
diag(/_i ,  a 2, . . .  ,  A p ) ,  where  the  a,  ’s  are  eigenvalues  of  S  and  C  is  an  orthogonal 
matrix  whose  columns  are  normalized  eigenvectors  of  S.  Thus  the  orthogonal  matrix 
A  that  diagonalizes  S  is  the  transpose  of  the  matrix  C: 


(  a;  \ 

*2 

V  a'p  ) 


(12.3) 


where  a ;  is  the  /  th  normalized  (a' a,  =  1)  eigenvector  of  S.  The  principal  components 
are  the  transformed  variables  z l  =  a'jy,  Z2  =  a)y,  ...  ,zp  —  a',y  in  z  =  Ay.  For 
example,  zi  =  anyi  +  ai2y2  H - h  a\pyp. 

By  (2.111),  the  diagonal  elements  of  ASA'  on  the  right  side  of  (12.2)  are  eigen¬ 
values  of  S.  Flence  the  eigenvalues  k\ ,  '/.j.  ■  .  ■  ,  ),p  of  S  are  the  (sample)  variances  of 
the  principal  components  z,i  =  a'y: 


4  =  A,-.  (12.4) 

Since  the  rotation  lines  up  with  the  natural  extensions  of  the  swarm  of  points,  zi  — 
a',  y  has  the  largest  (sample)  variance  and  z,p  —  a'py  has  the  smallest  variance.  This 
also  follows  from  (12.4),  because  the  variance  of  z,\  is  a i  .  the  largest  eigenvalue, 
and  the  variance  of  zp  is  Xp,  the  smallest  eigenvalue.  If  some  of  the  eigenvalues  are 
small,  we  can  neglect  them  and  represent  the  points  fairly  well  with  fewer  than  p 
dimensions.  For  example,  if  p  =  3  and  A3  is  small,  then  the  swarm  of  points  is  an 
“elliptical  pancake,”  and  a  two-dimensional  representation  will  adequately  portray 
the  configuration  of  points. 

Because  the  eigenvalues  are  variances  of  the  principal  components,  we  can  speak 
of  “the  proportion  of  variance  explained”  by  the  first  Ic  components: 
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Proportion  of  variance  = 


2-,‘j=  1  sjj 

since  A  =  tf(S)  by  (2.107).  Thus  we  try  to  represent  the  /7-dimensional  points 
(y;i,  yj 2,  . . .  ,  ytp)  with  a  few  principal  components  (zn,  zn, . . .  ,  Zik)  that  account 
for  a  large  proportion  of  the  total  variance.  If  a  few  variables  have  relatively  large 
variances,  they  will  figure  disproportionately  in  •  Sjj  and  in  the  principal  com¬ 
ponents.  For  example,  if  s22  is  strikingly  larger  than  the  other  variances,  then  in 
Zi  —  flnyi  +  a\2yi  +  •  •  •  +  a\pyp,  the  coefficient  a\2  will  be  large  and  all  other  a\  j 
will  be  small. 

When  a  ratio  analogous  to  (12.5)  is  used  for  discriminant  functions  and  canonical 
variates  [see  (8.13)  and  (11.9)],  it  is  frequently  referred  to  as  percent  of  variance. 
However,  in  the  case  of  discriminant  functions  and  canonical  variates,  the  eigenval¬ 
ues  are  not  variances,  as  they  are  in  principal  components. 

If  the  variables  are  highly  correlated,  the  essential  dimensionality  is  much  smaller 
than  p.  In  this  case,  the  first  few  eigenvalues  will  be  large,  and  (12.5)  will  be  close  to 
1  for  a  small  value  of  k.  On  the  other  hand,  if  the  correlations  among  the  variables  are 
all  small,  the  dimensionality  is  close  to  p  and  the  eigenvalues  will  be  nearly  equal.  In 
this  case,  the  principal  components  essentially  duplicate  the  variables,  and  no  useful 
reduction  in  dimension  is  achieved. 

Any  two  principal  components  z.i  =  a'y  and  Zj  =  a'-y  are  orthogonal  for  i  f  j; 
that  is,  a'a  /  =  0,  because  a,  and  a  j  are  eigenvectors  of  the  symmetric  matrix  S  (see 
Section  2.11.6).  Principal  components  also  have  the  secondary  property  of  being 
uncorrelated  in  the  sample  [see  (12.2)  and  (3.63)];  that  is,  the  covariance  of  Zi  and 
Zj  is  zero: 

sZizj  =  a{Sa/  =  0  for  i  f  j.  (12.6) 

Discriminant  functions  and  canonical  variates,  on  the  other  hand,  have  the  weaker 
property  of  being  uncorrelated  but  not  the  stronger  property  of  orthogonality.  Thus 
when  we  plot  the  first  two  discriminant  functions  or  canonical  variates  on  perpendic¬ 
ular  coordinate  axes,  there  is  some  distortion  of  their  true  relationship  because  the 
actual  angle  between  their  axes  is  not  90° . 

If  we  change  the  scale  on  one  or  more  of  the  y’s,  the  shape  of  the  swarm  of  points 
will  change,  and  we  will  need  different  components  to  represent  the  new  points. 
Hence  the  principal  components  are  not  scale  invariant.  We  therefore  need  to  be  con¬ 
cerned  with  the  units  in  which  the  variables  are  measured.  If  possible,  all  variables 
should  be  expressed  in  the  same  units.  If  the  variables  have  widely  disparate  vari¬ 
ances,  we  could  standardize  them  before  extracting  eigenvalues  and  eigenvectors. 
This  is  equivalent  to  finding  principal  components  of  the  correlation  matrix  R  and  is 
treated  in  Section  12.5. 

If  one  variable  has  a  much  greater  variance  than  the  other  variables,  the  swarm  of 
points  will  be  elongated  and  will  be  nearly  parallel  to  the  axis  corresponding  to  the 


7-1  +  ^2  +  •  •  •  +  Xk 
kl  +  7.2  +  •  •  •  +  kp 
kl  +  A.2  +  •  •  •  + 


(12.5) 


384 


PRINCIPAL  COMPONENT  ANALYSIS 


variable  with  large  variance.  The  first  principal  component  will  largely  represent  that 
variable,  and  the  other  principal  components  will  have  negligibly  small  variances. 
Such  principal  components  (based  on  S)  do  not  involve  the  other  p  —  I  variables, 
and  we  may  prefer  to  analyze  the  correlation  matrix  R. 

Example  12.2.1.  To  illustrate  principal  components  as  a  rotation  when  p  —  2,  we 
use  two  variables  from  the  sons  data  of  Table  3.7:  y \  is  head  length  and  V2  is  head 
width  for  the  first  son.  The  mean  vector  and  covariance  matrix  are 

_  _  /  185.7  \  /  95.29  52.87  \ 

y~\  151.1  )’  ^  —  \  52.87  54.36  J  ' 

The  eigenvalues  and  eigenvectors  of  S  are 

A.1  =  131.52,  7.2  =  18.14, 

a'j  =  (an,  a  12)  =  (.825,  .565),  a2  =  (021.022)  =  (—.565,  .825). 

The  symmetric  pattern  in  the  eigenvectors  is  due  to  their  orthogonality:  aj  82  = 
a\  \Q2\  +  a  12022  =  0. 

The  observations  are  plotted  in  Figure  12.1,  along  with  the  (translated  and)  rotated 
axes.  The  major  axis  is  the  line  passing  through  y'  =  (185.7,  151.1)  in  the  direction 
determined  by  a'j  =  (.825,  .565);  the  slope  is  an/an  —  .565/. 825.  Alternatively, 
the  equation  of  the  major  axis  can  be  obtained  by  setting  Z2  —  0: 

Z2  =  0  =  a2i(yi  -  7i)  +  «22 ( V2  -  y2) 

=  -.565(vi  -  185.7)  +  ,825(y2  -  151.1). 
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Figure  12.1.  Principal  component  transformation  for  the  sons  data. 
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The  lengths  of  the  semimajor  and  semiminor  axes  are  proportional  to  V^i  =  11.5 
and  y/ki  =  4.3,  respectively. 

Note  that  the  line  formed  by  the  major  axis  can  be  considered  to  be  a  regression 
line.  It  is  fit  to  the  points  so  that  the  perpendicular  distance  of  the  points  to  the  line  is 
minimized,  rather  than  the  usual  vertical  distance  (see  Section  12.3).  □ 


12.2.2  Algebraic  Approach 

An  algebraic  approach  to  principal  components  can  be  briefly  described  as  follows. 
As  noted  in  Section  12.1,  we  seek  a  linear  combination  with  maximal  variance.  By 
(3.55),  the  sample  variance  of  z  —  a'y  is  a'Sa.  Since  a'Sa  has  no  maximum  if  a  is 
unrestricted,  we  seek  the  maximum  of 


a'Sa 

a'a 


(12.7) 


By  an  argument  similar  to  that  used  in  (8.8)— (8. 12),  the  maximum  value  of  k  is  given 
by  the  largest  eigenvalue  in  the  expression 


(S  -  kl)a  =  0 


(12.8) 


(see  Problem  12.1).  The  eigenvector  ai  corresponding  to  the  largest  eigenvalue  k  |  is 
the  coefficient  vector  in  zt  =  a'jy,  the  linear  combination  with  maximum  variance. 

Unlike  discriminant  analysis  or  canonical  correlation,  there  is  no  inverse  involved 
before  obtaining  eigenvectors  for  principal  components.  Therefore,  S  can  be  singular, 
in  which  case  some  of  the  eigenvalues  are  zero  and  can  be  ignored.  A  singular  S 
would  arise,  for  example,  when  n  <  p,  that  is,  when  the  sample  size  is  less  than  the 
number  of  variables. 

This  tolerance  of  principal  component  analysis  for  a  singular  S  is  important  in 
certain  research  situations.  For  example,  suppose  that  one  has  a  one-way  MANOVA 
with  10  observations  in  each  of  three  groups  and  that  p  =  50,  so  that  there  are  50 
variables  in  each  of  these  30  observation  vectors.  A  MANOVA  test  involving  E  1 II 
cannot  be  carried  out  directly  in  this  case  because  E  is  singular,  but  we  could  reduce 
the  50  variables  to  a  small  number  of  principal  components  and  then  do  a  MANOVA 
test  on  the  components.  The  principal  components  would  be  based  on  S  obtained 
from  the  30  observations  ignoring  groups.  For  entry  into  the  MANOVA  program, 
we  would  evaluate  the  principal  components  for  each  observation  vector.  If  we  are 
retaining  k  components,  we  calculate 

zu  —  ii  i  y'i 
zn  —  a'2y  ,■ 


Zki  =  a^y  ,• 


(12.9) 
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for  i  =  l,2,..,,  30.  These  are  sometimes  referred  to  as  component  scores.  In  vector 
form,  (12.9)  can  be  rewritten  as 

*f=Akyi,  (12.10) 

where 


(  Zli  ^ 

(  a'j  \ 

Z2i 

a2 

Z;  = 

and  A£  = 

\  Zki  ) 

l  ait  / 

We  then  use  Z\,Z2, ...  ,  Z30  as  input  to  the  MANOVA  program. 

Note  that  in  this  case  with  p  >  n,  the  k  components  would  not  likely  be  stable; 
that  is,  they  would  be  different  in  a  new  sample.  However,  this  is  of  no  concern  here 
because  we  are  using  the  components  only  to  extract  information  from  the  sample  at 
hand  in  order  to  compare  the  three  groups. 


Example  12.2.2.  Consider  the  football  data  of  Table  8.3.  In  Example  8.8,  we  saw 
that  high  school  football  players  (group  1)  differed  from  the  other  two  groups,  college 
football  players  and  college-age  nonfootball  players.  Therefore,  to  obtain  a  homoge¬ 
neous  group  of  observations,  we  delete  group  1  and  use  groups  2  and  3  combined. 
The  covariance  matrix  is  as  follows: 


/  .370 

.602  .149 

.044 

.107 

.209 

.602 

2.629  .801 

.666 

.103 

.377 

c _ 

.149 

.801  .458 

.011  - 

.013 

.120 

&  = 

.044 

.666  .011 

1.474 

.252 

-.054 

.107 

.103  -.013 

.252 

.488 

-.036 

v  .209 

.377  .120 

-.054  - 

.036 

.324 

The  total  variance  is 

6  6 

=  E 

j  =  1  !  =  1 

Xi  =  5.743 

The  eigenvalues  of  S  are  as  follows: 

Eigenvalue 

Proportion 
of  Variance 

Cumulative 

Proportion 

3.323 

.579 

.579 

1.374 

.239 

.818 

.476 

.083 

.901 

.325 

.057 

.957 

.157 

.027 

.985 

.088 

.015 

1.000 
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The  first  two  principal  components  account  for  81.8%  of  the  total  variance.  The 
corresponding  eigenvectors  are  as  follows: 


»i 

a2 

WDIM 

.207 

-.142 

CIRCUM 

.873 

-.219 

FBEYE 

.261 

-.231 

EYEHD 

.326 

.891 

EARHD 

.066 

.222 

JAW 

.128 

-.187 

Thus  the  first  two  principal  components  are 


z  1  =  a,y  =  .207 yi  +  .873.V2  +  -261  >’3  +  .326.V4  +  .O66V5  +  -128v6, 
Z2  —  a^y  =  —  .142yi  -  .219y2  -  -231y3  +  .891y4  +  .222 y5  -  .187y6. 


Notice  that  the  large  coefficient  in  21  and  the  large  coefficient  in  Z2,  -873  and  .891, 
respectively,  correspond  to  the  two  largest  variances  on  the  diagonal  of  S.  The  two 
variables  with  large  variances,  y2  and  V4,  have  a  notable  influence  on  the  first  two 
principal  components.  However,  z,\  and  Z2  are  still  meaningful  linear  functions.  If  the 
six  variances  were  closer  in  size,  the  six  variables  would  enter  more  evenly  into  the 
first  two  principal  components.  On  the  other  hand,  if  the  variances  of  y2  and  V4  were 
substantially  larger,  z,\  and  Z2  would  be  essentially  equal  to  y2  and  V4,  respectively. 

Note  that  y2  and  y3  did  not  contribute  at  all  when  this  data  set  was  used  to  separate 
groups  in  Examples  8.5,  8.9,  9.3.1,  and  9.6(a).  However,  these  two  variables  are  very 
useful  here  in  the  first  two  dimensions  showing  the  spread  of  individual  observations. 

□ 


12.3  PRINCIPAL  COMPONENTS  AND 
PERPENDICULAR  REGRESSION 

It  was  noted  in  Section  12.2. 1  that  principal  components  constitute  a  rotation  of  axes. 
Another  geometric  property  of  the  line  formed  by  the  first  principal  component  is  that 
it  minimizes  the  total  sum  of  squared  perpendicular  distances  from  the  points  to  the 
line.  This  is  easily  demonstrated  in  the  bivariate  case.  The  first  principal  component 
line  is  plotted  in  Figure  12.2  for  the  first  two  variables  of  the  sons  data,  as  in  Exam¬ 
ple  12.2.1.  The  perpendicular  distance  from  each  point  to  the  line  is  simply  Z2,  the 
second  coordinate  in  the  transformed  coordinates  (z,\ ,  zz)-  Hence  the  sum  of  squares 
of  perpendicular  distances  is 


E4  =  E[a2(h'-!')]2’ 


i  =  1  !  =  1 


(12.11) 
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Figure  12.2.  The  first  principal  component  as  a  perpendicular  regression  line. 


where  a2  is  the  second  eigenvector  of  S,  and  we  use  y,  —  y  because  the  axes  have 
been  translated  to  the  new  origin  y.  Since  a-,(y,  —  y)  =  (y,  —  y)'a2,  we  can  write 
(12.11)  in  the  form 


=  YaAyi  -  y)(y »•  -  y)'a2 


i=t  i=i 


=  a 


£>,--y)(y,--y)'  la2  [by  (2.44)] 

i 

=  (n  -  l)a'2Sa2  =  (n  -  l)X2  [by  (3.27)], 


(12.12) 


which  is  a  minimum  by  a  remark  following  (12.4). 

For  the  two  variables  y\  and  y2,  as  plotted  in  Figure  12.2,  the  ordinary  regression 
line  of  y2  on  y i  minimizes  the  sum  of  squares  of  vertical  distances  from  the  points 
to  the  line.  Similarly,  the  regression  of  y\  on  V2  minimizes  the  sum  of  squares  of 
horizontal  distances  from  the  points  to  the  line.  The  first  principal  component  line 
represents  a  “perpendicular”  regression  line  that  lies  between  the  other  two.  The 
three  lines  are  compared  in  Figure  12.3  for  the  partial  sons  data.  The  equation  of  the 
first  principal  component  line  is  easily  obtained  by  setting  z2  —  0: 

22  =  a2(y  -  y)  =  0, 
tf2i  (Jt  -  Ti)  +  a22(y2  -  y2)  =  0, 

—  .565()t  -  vj)  +  ,825(y2  -  y2)  =  0. 
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yi 


Figure  12.3.  Regression  lines  compared  with  first  principal  component  line. 


12.4  PLOTTING  OF  PRINCIPAL  COMPONENTS 

The  plots  in  Figures  12.1  and  12.2  were  illustrations  of  principal  components  as  a 
rotation  of  axes  when  p  =  2.  When  p  >  2,  we  can  plot  the  first  two  components  as  a 
dimension  reduction  device.  We  simply  evaluate  the  first  two  components  (zi,  Z2)  for 
each  observation  vector  and  plot  these  n  points.  The  plot  is  equivalent  to  a  projection 
of  the  /r-dimensional  data  swarm  onto  the  plane  that  shows  the  greatest  spread  of  the 
points. 

The  plot  of  the  first  two  components  may  reveal  some  important  features  of  the 
data  set.  In  Example  12.4(a),  we  show  a  principal  component  plot  that  exhibits  a  pat¬ 
tern  typical  of  a  sample  from  a  multivariate  normal  distribution.  One  of  the  objectives 
of  plotting  is  to  check  for  departures  from  normality,  such  as  outliers  or  nonlinearity. 
In  Examples  12.4(b)  and  12.4(c),  we  illustrate  principal  component  plots  showing  a 
nonnormal  pattern  characterized  by  the  presence  of  outliers.  Jackson  (1980)  provided 
a  test  for  adequacy  of  representation  of  observation  vectors  in  terms  of  principal  com¬ 
ponents. 

Gnanadesikan  (1997,  p.  308)  pointed  out  that,  in  general,  the  first  few  principal 
components  are  sensitive  to  outliers  that  inflate  variances  or  distort  covariances,  and 
the  last  few  are  sensitive  to  outliers  that  introduce  artificial  dimensions  or  mask  sin¬ 
gularities.  We  could  examine  the  bivariate  plots  of  at  least  the  first  two  and  the  last 
two  principal  components  in  a  search  for  outliers  that  may  exert  undue  influence. 

Devlin,  Gnanadesikan,  and  Kettenring  (1981)  recommended  the  extraction  of 
principal  components  from  robust  estimates  of  S  or  R  that  reduce  the  influence  of 
outliers.  Campbell  (1980)  and  Ruymgaart  (1981)  discussed  direct  robust  estimation 
of  principal  components.  Critchley  (1985)  developed  methods  for  detection  of  influ¬ 
ential  observations  in  principal  component  analysis. 
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Figure  12.4.  Plot  of  first  two  components  for  the  modified  football  data. 


Another  feature  of  the  data  that  a  plot  of  the  first  two  components  may  reveal  is 
a  tendency  of  the  points  to  cluster.  The  plot  may  reveal  groupings  of  points;  this  is 
illustrated  in  Example  12.4(d). 

Example  12.4(a).  For  the  modified  football  data  in  Example  12.2.2,  the  first  two 
principal  components  were  given  as  follows: 

z\  =  a\y  =  .207yi  +  .873^2  +  -261  y  3  +  .326_V4  +  .066ys  +  .128y6, 

Z2  —  a2y  =  — .142yi  —  ,219y2  —  -231y3  +  .891_V4  +  .222y5  —  .  1 87>’6- 

These  are  evaluated  for  each  observation  vector  and  plotted  in  Figure  12.4.  (For 
convenience  in  scaling,  y  —  y  was  used  in  the  computations.)  The  pattern  is  typical 
of  that  from  a  multivariate  normal  distribution.  Note  that  the  variance  along  the  z  1 
axis  is  greater  than  the  variance  in  the  zi  direction,  as  expected.  □ 

Example  12.4(b).  In  Figures  4.9  and  4.10,  the  Q—Q  plot  and  bivariate  scatter  plots 
for  the  ramus  bone  data  of  Table  3.6  exhibit  a  nonnormal  pattern.  A  principal  com¬ 
ponent  analysis  using  the  covariance  matrix  is  given  in  Table  12.1,  and  the  first  two 


Table  12.1.  Principal  Components  for  the  Ramus  Bone  Data  of  Table  3.6 


Eigenvalues 

First  Two  Eigenvectors 

Number 

Value 

Variable 

ai 

a2 

1 

25.05 

AGE  8 

.474 

.592 

2 

1.74 

AGE  8.5 

.492 

.406 

3 

.22 

AGE  9 

.515 

-.304 

4 

.11 

AGE  9.5 

.517 

-.627 
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Figure  12.5.  First  two  principal  components  for  the  ramus  bone  data  in  Table  3.6. 


principal  components  are  plotted  in  Figure  12.5.  The  presence  of  three  outliers  that 
cause  a  nonnormal  pattern  is  evident.  These  outliers  do  not  appear  when  the  four 
variables  are  examined  individually.  □ 

Example  12.4(c).  A  rather  extreme  example  of  the  effect  of  an  outlier  is  given  by 
Devlin,  Gnanadesikan,  and  Kettenring  (1981).  The  data  set  involved  p  —  14  eco¬ 
nomical  variables  for  n  =  29  chemical  companies.  The  first  two  principal  compo¬ 
nents  are  plotted  in  Figure  12.6.  The  sample  correlation  rzlZ2  is  indeed  zero  for  all 
29  points,  as  it  must  be  [see  (12.6)],  but  if  the  apparent  outlier  is  excluded  from 
the  computation,  then  rzlZ2  —  .99  for  the  remaining  28  points.  If  the  outlier  were 
deleted  from  the  data  set,  the  axes  of  the  principal  components  would  pass  through 
the  natural  extensions  of  the  data  swarm.  □ 

Example  12.4(d).  Jeffers  (1967)  applied  principal  component  analysis  to  a  sample 
of  40  alate  adelges  (winged  aphids)  on  which  the  following  19  variables  had  been 
measured: 


LENGTH 
WIDTH 
FORWING 
HINWING 
SPIRAC 
ANTSEG  1 
ANTSEG 2 
ANTSEG  3 
ANTSEG  4 
ANTSEG  5 


body  length 
body  width 
forewing  length 
hind-wing  length 
number  of  spiracles 
length  of  antennal  segment  I 
length  of  antennal  segment  II 
length  of  antennal  segment  III 
length  of  antennal  segment  IV 
length  of  antennal  segment  V 
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1st  PRINCIPAL  COMPONENT 


Figure  12.6.  First  two  principal  components  for  economics  data. 


ANTSPIN 

number  of  antennal  spines 

TARSUS  3 

leg  length,  tarsus  III 

TIBIA  3 

leg  length,  tibia  III 

FEMUR  3 

leg  length,  femur  III 

ROSTRUM 

rostrum 

OVIPOS 

ovipositor 

OVSPIN 

number  of  ovipositor  spines 

FOLD 

anal  fold 

HOOKS 

number  of  hind- wing  hooks 

An  objective  in  the  study  was  to  determine  the  number  of  distinct  taxa  present 
in  the  habitat  where  the  sample  was  taken.  Since  adelges  are  difficult  to  identify  by 
the  usual  taxonomic  methods,  principal  component  analysis  was  used  to  search  for 
groupings  among  the  40  individuals  in  the  sample. 

The  correlation  matrix  is  given  in  Table  12.2,  and  the  eigenvalues  and  first  four 
eigenvectors  are  in  Tables  12.3  and  12.4,  respectively.  The  eigenvectors  are  scaled 
so  that  the  largest  value  in  each  is  1.  The  first  principal  component  is  largely  an 
index  of  size.  The  second  component  is  associated  with  SPIRAC,  OVIPOS,  OVSP1N, 
and  FOLD. 

The  first  two  components  were  computed  for  each  of  the  40  individuals  and 
plotted  in  Figure  12.7.  Since  the  first  two  components  account  for  85%  of  the  total 
variance,  the  plot  represents  the  data  with  very  little  distortion.  There  are  four  major 
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Table  12.2.  Correlation  Matrix  for  Winged  Aphid  Variables  (Lower  Triangle) 


Ti 


Vi 

1.000 

V2 

yi 

.934 

1.000 

V3 

y~i 

.927 

.941 

1.000 

V4 

,V4 

.909 

.944 

.933 

1.000 

,V5 

,V5 

.524 

.487 

.543 

.499 

1.000 

,V6 

,V6 

.799 

.821 

.856 

.833 

.703 

1.000 

y? 

yn 

.854 

.865 

.886 

.889 

.719 

.923 

1.000 

ys 

,V8 

.789 

.834 

.846 

.885 

.253 

.699 

.751 

1.000 

y? 

y 9 

.835 

.863 

.862 

.850 

.462 

.752 

.793 

.745 

1.000 

VlO 

VlO 

.845 

.878 

.863 

.881 

.567 

.836 

.913 

.787 

.805 

1.000 

yu 

-.458 

-.496 

-.522 

-.488 

-.174 

-.317 

-.383 

-.497 

-.356 

-.371 

Vl2 

.917 

.942 

.940 

.945 

.516 

.846 

.907 

.861 

.848 

.902 

Vl3 

.939 

.961 

.956 

.952 

.494 

.849 

.914 

.876 

.877 

.901 

Vl4 

.953 

.954 

.946 

.949 

.452 

.823 

.886 

.878 

.883 

.891 

,Vl5 

.895 

.899 

.882 

.908 

.551 

.831 

.891 

.794 

.818 

.848 

,Vl6 

.691 

.652 

.694 

.623 

.815 

.812 

.855 

.410 

.620 

.712 

Vl7 

.327 

.305 

.356 

.272 

.746 

.553 

.567 

.067 

.300 

.384 

Vl8 

.676 

-.712 

-.667 

-.736 

-.233 

-.504 

-.502 

-.758 

-.666 

-.629 

Vl9 

.702 

.729 

.746 

.777 

.285 

.499 

.592 

.793 

.671 

.668 

Hi 

yn 

1.000 

yi2 

yu 

-.465 

1.000 

yi3 

yu 

-.447 

.981 

1.000 

yu 

yu 

-.439 

.971 

.991 

1.000 

Ji5 

,Vl5 

-.405 

.908 

.920 

.921 

1.000 

yi6 

,Vl6 

-.198 

.725 

.714 

.676 

.720 

1.000 

yn 

Vl7 

-.032 

.396 

.360 

.298 

.378 

.781 

1.000 

yis 

Vl8 

.492 

-.657 

-.655 

-.678 

-.633 

-.186 

.169 

1.000 

yi9 

.Vl9 

-.425 

.696 

.724 

.731 

.694 

.287 

.026 

-.775 

1.000 

groups,  apparently  corresponding  to  species.  The  groupings  form  an  interesting 
S-shape.  □ 

12.5  PRINCIPAL  COMPONENTS  FROM  THE  CORRELATION  MATRIX 

Generally,  extracting  components  from  S  rather  than  R  remains  closer  to  the  spirit 
and  intent  of  principal  component  analysis,  especially  if  the  components  are  to  be 
used  in  further  computations.  However,  in  some  cases,  the  principal  components  will 
be  more  interpretable  if  R  is  used.  For  example,  if  the  variances  differ  widely  or  if  the 
measurement  units  are  not  commensurate,  the  components  of  S  will  be  dominated  by 
the  variables  with  large  variances.  The  other  variables  will  contribute  very  little.  For 
a  more  balanced  representation  in  such  cases,  components  of  R  may  be  used  (see, 
for  example,  Problem  12.9). 
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Table  12.3.  Eigenvalues  of  the  Correlation  Matrix  of  the  Winged  Aphid  Data 


Component 

Eigenvalue 

Percent  of  Variance 

Cumulative  Percent 

1 

13.861 

73.0 

73.0 

2 

2.370 

12.5 

85.4 

3 

.748 

3.9 

89.4 

4 

.502 

2.6 

92.0 

5 

.278 

1.4 

93.5 

6 

.266 

1.4 

94.9 

7 

.193 

1.0 

95.9 

8 

.157 

.8 

96.7 

9 

.140 

.7 

97.4 

10 

.123 

.6 

98.1 

11 

.092 

.4 

98.6 

12 

.074 

.4 

99.0 

13 

.060 

.3 

99.3 

14 

.042 

.2 

99.5 

15 

.036 

.2 

99.7 

16 

.024 

.1 

99.8 

17 

.020 

.1 

99.9 

18 

.011 

.1 

100.0 

19 

.003 

.0 

100.0 

19.000 

Table  12.4.  Eigenvectors  for  the  First  Four  Components  of  the  Winged  Aphid  Data 


Eigenvectors 


Variable 

1 

2 

3 

4 

LENGTH 

.96 

-.06 

.03 

-.12 

WIDTH 

.98 

-.12 

.01 

-.16 

FORWING 

.99 

-.06 

-.06 

-.11 

HINWING 

.98 

-.16 

.03 

-.00 

SPIRAC 

.61 

.74 

-.20 

1.00 

ANTSEG  1 

.91 

.33 

.04 

.02 

ANTSEG 2 

.96 

.30 

.00 

-.04 

ANTSEG  3 

.88 

-.43 

.06 

-.18 

ANTSEG  4 

.90 

-.08 

.18 

-.01 

ANTSEG  5 

.94 

.05 

.11 

.03 

ANTSPIN 

-.49 

.37 

1.00 

.27 

TARSUS  3 

.99 

-.02 

.03 

-.29 

TIBIA  3 

1.00 

-.05 

.09 

-.31 

FEMUR  3 

.99 

-.12 

.12 

-.31 

ROSTRUM 

.96 

.02 

.08 

-.06 

OVIPOS 

.76 

.73 

-.03 

-.09 

OVSPIN 

.41 

1.00 

-.16 

-.06 

FOLD 

-.71 

.64 

.04 

-.80 

HOOKS 

.76 

-.52 

.06 

.72 
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As  with  any  change  of  scale,  when  the  variables  are  standardized  in  transforming 
from  S  to  R,  the  shape  of  the  swarm  of  points  will  change.  Note,  however,  that 
after  transforming  to  R,  any  further  changes  of  scale  on  the  variables  would  not 
affect  the  components  because  changes  of  scale  do  not  change  R.  Thus  the  principal 
components  from  R  are  scale  invariant. 

To  illustrate  how  the  eigenvalues  and  eigenvectors  change  when  converting  from 
S  to  R,  we  use  a  simple  bivariate  example  in  which  one  variance  is  substantially 
larger  than  the  other.  Suppose  that  S  and  the  corresponding  R  have  the  values 


The  eigenvalues  and  eigenvectors  from  S  are 

Xi  —  25.65,  a'j  =  (.160,  .987), 

A2  =  .35,  aj  =  (.987,  -.160). 

The  patterns  we  see  in  k\ ,  ki.  3| ,  and  a2  are  quite  predictable.  The  symmetry  in 
ai  and  a2  is  due  to  their  orthogonality,  a'1a2  =  0.  The  large  variance  of  y2  in  S  is 
reflected  in  the  first  principal  component  zi  —  .  1 60y  i  +  .987 y2,  where  y2  is  weighted 
heavily.  Thus  the  first  principal  component  zi  essentially  duplicates  y2  and  does  not 
show  the  mutual  effect  of  y  i  and  y2.  As  expected,  z\  accounts  for  virtually  all  of  the 
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total  variance: 


A| 

Al  +  A  2 


25.65 

26 


.9865. 


The  eigenvalues  and  eigenvectors  of  R  are 


ki  =  1.8,  a'j  =  (.707,  .707), 

A. 2  =  .2,  a'2  =  (.707,  -.707). 


The  first  principal  component  of  R. 


Zi  =  .707  ~V1  -  -Vl  +  .707  32  -  y2, 

accounts  for  a  high  proportion  of  variance, 

ki  1.8 


A-i  +  A  2 


=  .9, 


because  the  variables  are  fairly  highly  correlated  (r  =  .8).  But  the  standardized 
variables  (y\  —  y j )/ 1  and  (y2  —  y2)/5  are  equally  weighted  in  zi,  due  to  the  equality 
of  the  diagonal  elements  (“variances”)  of  R. 

We  now  list  some  general  comparisons  of  principal  components  from  R  with  those 
from  S: 


1.  The  percent  of  variance  in  (12.5)  accounted  for  by  the  components  of  R  will 
differ  from  the  percent  for  S,  as  illustrated  above. 

2.  The  coefficients  of  the  principal  components  from  R  differ  from  those  obtained 
from  S,  as  illustrated  above. 

3.  If  we  express  the  components  from  R  in  terms  of  the  original  variables,  they 
still  will  not  agree  with  the  components  from  S.  By  transforming  the  stan¬ 
dardized  variables  back  to  the  original  variables  in  the  above  illustration,  the 
components  of  R  become 


Vl  —  Vl  Vo  —  yo 

zi  =  .707— —j—!-  +  .707-  -  -  -  - 

=  .707yi  +  .141y2  +  const, 

Z2  =  .707  -V1  ~  V*  -  .707  -V2  ~ 

=  .707vi  —  .141 V2  +  const. 


As  expected,  these  are  very  different  from  the  components  extracted  directly 
from  S.  This  problem  arises,  of  course,  because  of  the  lack  of  scale  invariance 
of  the  components  of  S. 
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4.  The  principal  components  from  R  are  scale  invariant,  because  R  itself  is  scale 
invariant. 

5.  The  components  from  a  given  matrix  R  are  not  unique  to  that  R.  For  example, 
in  the  bivariate  case,  the  eigenvalues  of 


are  given  by 

a,  =  1  +  r,  A2  =  1  -  r,  (12.13) 


and  the  eigenvectors  are  a'j  =  (.707,  .707)  and  a'2  =  (.707,  —.707),  which 
give  principal  components 


Zi  =  .707— - —  +  .707— — — 

SI  S2 

Z2  =  .707  —  —  -Vl  -  .707 -V2  ~  -V2 

S{  S2 


(12.14) 


The  components  in  (12.14)  do  not  depend  on  r.  For  example,  they  serve 
equally  well  for  r  =  .01  and  for  r  =  .99.  For  r  =  .01,  the  proportion  of 
variance  explained  by  zi  is  Ai/(Ai  +  k2)  =  (1  +  .01 )/( 1  +  .01  +  1  —  .01)  = 
1.01/2  =  .505.  For  r  =  .99,  the  ratio  is  1.99/2  =  .995.  Thus  the  statement 
that  the  first  component  from  a  correlation  matrix  accounts  for,  say,  90%  of 
the  variance  is  not  very  meaningful.  In  general,  for  p  >  2,  the  components 
from  R  depend  only  on  the  ratios  (relative  values)  of  the  correlations,  not  on 
their  actual  values,  and  components  of  a  given  R  matrix  will  serve  for  other  R 
matrices  [see  Rencher  (1998,  Section  9.4)]. 


12.6  DECIDING  HOW  MANY  COMPONENTS  TO  RETAIN 

In  every  application,  a  decision  must  be  made  on  how  many  principal  components 
should  be  retained  in  order  to  effectively  summarize  the  data.  The  following  guide¬ 
lines  have  been  proposed: 

1.  Retain  sufficient  components  to  account  for  a  specified  percentage  of  the  total 
variance,  say,  80%. 

2.  Retain  the  components  whose  eigenvalues  are  greater  than  the  average  of  the 

eigenvalues,  £f=1  /  p.  For  a  correlation  matrix,  this  average  is  1. 

3.  Use  the  scree  graph ,  a  plot  of  A.,-  versus  and  look  for  a  natural  break  between 
the  “large”  eigenvalues  and  the  “small”  eigenvalues. 

4.  Test  the  significance  of  the  “larger”  components,  that  is,  the  components  cor¬ 
responding  to  the  larger  eigenvalues. 
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We  now  discuss  these  four  criteria  for  choosing  the  components  to  keep.  Note, 
however,  that  the  smallest  components  may  carry  valuable  information  that  should 
not  be  routinely  ignored  (see  Section  12.7). 

In  method  1 ,  the  challenge  lies  in  selecting  an  appropriate  threshold  percentage. 
If  we  aim  too  high,  we  run  the  risk  of  including  components  that  are  either  sample 
specific  or  variable  specific.  By  sample  specific  we  mean  that  a  component  may 
not  generalize  to  the  population  or  to  other  samples.  A  variable-specific  component 
is  dominated  by  a  single  variable  and  does  not  represent  a  composite  summary  of 
several  variables. 

Method  2  is  widely  used  and  is  the  default  in  many  software  packages.  By 
(2.107),  JA  X i  =  tr(S),  and  the  average  eigenvalue  is  also  the  average  variance  of 
the  individual  variables.  Thus  method  2  retains  those  components  that  account  for 
more  variance  than  the  average  variance  of  the  variables.  In  cases  where  the  data 
can  be  successfully  summarized  in  a  relatively  small  number  of  dimensions,  there  is 
often  a  wide  gap  between  the  two  eigenvalues  that  fall  on  both  sides  of  the  average. 
In  Example  12.2.2,  the  average  eigenvalue  (of  S)  for  the  football  data  is  .957,  which 
is  amply  bracketed  by  A  2  —  1.37  and  A. 3  =  .48.  In  the  winged  aphid  data  in  Exam¬ 
ple  12.4(d),  the  second  and  third  eigenvalues  (of  R)  are  2.370  and  .748,  leaving  a 
comfortable  margin  on  both  sides  of  1.  In  some  cases,  one  may  wish  to  move  the 
cutoff  point  slightly  to  accommodate  a  visible  gap  in  eigenvalues. 

The  scree  graph  in  method  3  is  named  for  its  similarity  in  appearance  to  a  cliff 
with  rocky  debris  at  its  bottom.  The  scree  graph  for  the  modified  football  data  of 
Example  12.2.2  exhibits  an  ideal  pattern,  as  shown  in  Figure  12.8.  The  first  two 
eigenvalues  form  a  steep  curve  followed  by  a  bend  and  then  a  straight-line  trend  with 
shallow  slope.  The  recommendation  is  to  retain  those  eigenvalues  in  the  steep  curve 
before  the  first  one  on  the  straight  line.  Thus  in  Figure  12.8,  two  components  would 
be  retained.  In  practice,  the  turning  point  between  the  steep  curve  and  the  straight 
line  may  not  be  as  distinct  as  this  or  there  may  be  more  than  one  discernible  bend.  In 
such  cases,  this  approach  is  not  as  conclusive.  The  scree  graph  for  the  winged  aphid 


Eigenvalue  Number 

Figure  12.8.  Scree  graph  for  eigenvalues  of  modified  football  data. 
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Eigenvalue  Number 


Figure  12.9.  Scree  graph  for  eigenvalues  of  winged  aphid  data. 


data  in  Example  12.4(d)  is  plotted  in  Figure  12.9.  The  plot  would  suggest  that  two 
components  be  retained  (possibly  four). 

The  remainder  of  this  section  is  devoted  to  method  4,  tests  of  significance.  The 
tests  assume  multivariate  normality,  which  is  not  required  for  estimation  of  principal 
components. 

It  may  be  useful  to  make  a  preliminary  test  of  complete  independence  of  the 
variables,  as  in  Section  7.4.3:  Hq\  X  =  diag(<rn,  <722,  ■  ■ .  ,  crpp),  or  equivalently, 
Hq  :  Pp  =  I.  The  test  statistic  is  given  in  (7.37)  and  (7.38).  If  the  results  indicate  that 
the  variables  are  independent,  there  is  no  point  in  extracting  principal  components, 
since  (except  for  sampling  fluctuation)  the  variables  themselves  already  form  the 
principal  components. 

To  test  the  significance  of  the  “larger”  components,  we  test  the  hypothesis  that  the 
last  k  population  eigenvalues  are  small  and  equal.  Hok :  Yp-k+\  —  Yp-k+2  =  •  •  ■  — 
YP,  where  yi>Y2,  ■■■  •  Yp  denote  the  population  eigenvalues,  namely,  the  eigenvalues 
of  X-  The  implication  is  that  the  first  sample  components  capture  all  the  essential 
dimensions,  whereas  the  last  components  reflect  noise.  If  Ho  is  true,  the  last  k  sample 
eigenvalues  will  tend  to  have  the  pattern  shown  by  the  straight  line  with  small  slope 
in  the  ideal  scree  graph,  such  as  in  Figure  12.8  or  12.9. 

To  test  Hok  ■  Yp-k+i  —  •  •  •  =  yp  using  a  likelihood  ratio  approach,  we  calculate 
the  average  of  the  last  k  eigenvalues  of  S, 


-  E 

=p-k+ 1 


and  use  the  test  statistic 


p 

E  liU‘ 

i=p—k+ 1 


U  = 


(12.15) 
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which  has  an  approximate  /^distribution.  We  reject  Hq  if  u  >  X&  v>  where  v  = 

\(k-\)(k  +  2). 

To  carry  out  this  procedure,  we  could  begin  by  testing  H02 :  Yp- 1  =  yp.  If  this  is 
accepted,  we  could  then  test  //03 :  Yp- 2  —  Yp- 1  =  Yp  and  continue  testing  in  this 
fashion  until  Hot  is  rejected  for  some  value  of  k. 

In  practice,  when  the  variables  are  fairly  highly  correlated  and  the  data  can  be 
successfully  represented  by  a  small  number  of  principal  components,  the  first  three 
methods  will  typically  agree  on  the  number  of  components  to  retain,  and  the  test  in 
method  4  will  often  indicate  a  larger  number  of  components. 

Example  12.6.  We  apply  the  preceding  four  criteria  to  the  modified  football  data  of 
Example  12.2.2. 

For  method  1 ,  we  simply  examine  the  eigenvalues  and  their  proportion  of  variance 
explained,  as  obtained  in  Example  12.2.2: 


Proportion  Cumulative 
Eigenvalue  of  Variance  Proportion 


3.323 

.579 

.579 

1.374 

.239 

.818 

.476 

.083 

.901 

.325 

.057 

.957 

.157 

.027 

.985 

.088 

.015 

1.000 

To  account  for  82%  of  the  variance,  we  would  keep  two  components.  This  percent 
of  variance  is  high  enough  for  most  descriptive  purposes.  For  certain  other  applica¬ 
tions,  such  as  input  to  another  analysis,  we  might  wish  to  retain  three  components, 
which  would  account  for  90%  of  the  variance. 

To  apply  method  2,  we  find  the  average  eigenvalue  to  be 


X=E 


Xi_ 

6 


5.742824 

6 


.957. 


Since  only  k  1  and  A.  2  exceed  .957,  we  would  retain  two  components. 

For  method  3,  the  scree  graph  in  Figure  12.8  indicates  conclusively  that  two  com¬ 
ponents  should  be  retained. 

To  implement  method  4,  we  carry  out  the  significance  tests  in  (12.15).  The  values 
of  the  test  statistic  u  for  k  =  2,  3, . . .  ,6  are  as  follows: 


Eigenvalue 

k 

u 

df 

X.05 

3.32341 

6 

245.57 

20 

31.41 

1.37431 

5 

123.93 

14 

23.68 

.47607 

4 

44.10 

9 

16.92 

.32468 

3 

23.84 

5 

11.07 

.15650 

2 

4.62 

2 

5.99 

.08785 

1 
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The  tests  indicate  that  only  the  last  two  (population)  eigenvalues  are  equal,  and 
we  should  retain  the  first  four.  This  differs  from  the  results  of  the  other  three  criteria, 
which  are  in  close  agreement  that  two  components  should  be  retained.  □ 


12.7  INFORMATION  IN  THE  LAST  FEW  PRINCIPAL  COMPONENTS 

Up  to  this  point,  we  have  focused  on  using  the  first  few  principal  components  to 
summarize  and  simplify  the  data.  However,  the  last  few  components  may  carry  useful 
information  in  some  applications. 

Since  the  eigenvalues  serve  as  variances  of  the  principal  components,  the  last  few 
principal  components  have  smaller  variances.  If  the  variance  of  a  component  is  zero 
or  close  to  zero,  the  component  represents  a  linear  relationship  among  the  variables 
that  is  essentially  constant;  that  is,  the  relationship  holds  for  all  y,  ’s  in  the  sample. 
Thus  if  the  last  eigenvalue  is  near  zero,  it  signifies  the  presence  of  a  collinearity  that 
may  provide  new  information  for  the  researcher.  Suppose,  for  example,  that  there  are 
five  variables  and  V5  =  ^  ■  |  yj /A.  Then  S  is  singular,  and  barring  round-off  error, 
A. 5  will  be  zero.  Thus  s'z  —  0,  and  Z5  is  constant.  As  noted  early  in  Section  12.2,  the 
y,  ’s  are  centered,  because  the  origin  of  the  principal  components  is  translated  to  y. 
Hence  the  constant  value  of  Z5  is  its  mean,  which  is  zero: 

Z5  =  a'5y  —  «5i yi  +  a52yi  H - h  «55.y5  =  0. 

Since  this  must  reflect  the  dependency  of  V5  on  yi,  y2,  >’3,  and  yq,  the  eigenvector  af 
will  be  proportional  to  (1,1,  1,1, —4). 


12.8  INTERPRETATION  OF  PRINCIPAL  COMPONENTS 

In  Section  12.5,  we  noted  that  principal  components  obtained  from  R  are  not  compat¬ 
ible  with  those  obtained  from  S.  Because  of  this  lack  of  scale  invariance  of  principal 
components  from  S,  the  coefficients  cannot  be  converted  to  standardized  form,  as 
can  be  done  with  coefficients  in  discriminant  functions  in  Chapter  8  and  canoni¬ 
cal  variates  in  Chapter  11.  Hence  interpretation  of  principal  components  is  not  as 
clear-cut  as  with  previous  linear  functions  that  we  have  discussed.  We  must  choose 
between  components  of  S  or  R,  knowing  they  will  have  a  different  interpretation.  If 
the  variables  have  widely  disparate  variances,  we  can  use  R  instead  of  S  to  improve 
interpretation. 

For  certain  patterns  of  elements  in  S  or  R.  the  form  of  the  principal  components 
can  be  predicted.  This  aid  to  interpretation  is  discussed  in  Section  12.8.1.  As  with 
discriminant  functions  and  canonical  variates,  some  writers  have  advocated  rotation 
and  the  use  of  correlations  between  the  variables  and  the  principal  components.  We 
argue  against  the  use  of  these  two  approaches  to  interpretation  in  Sections  12.8.2  and 
12.8.3. 
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12.8.1  Special  Patterns  in  S  or  R 

In  the  covariance  or  correlation  matrix,  we  may  recognize  a  distinguishing  pattern 
from  which  the  structure  of  the  principal  components  can  be  deduced.  For  example, 
we  noted  in  Section  12.2  that  if  one  variable  has  a  much  larger  variance  than  the 
other  variables,  this  variable  will  dominate  the  first  component,  which  will  account 
for  most  of  the  variance.  Another  case  in  which  a  component  will  duplicate  a  variable 
occurs  when  the  variable  is  uncorrelated  with  the  other  variables.  We  now  demon¬ 
strate  this  by  showing  that  if  all  p  variables  are  uncorrelated,  the  variables  them¬ 
selves  are  the  principal  components.  If  the  variables  were  uncorrelated  (orthogonal), 
S  would  have  the  form 


S  = 


0 

S22 


\  0  0 


and  the  characteristic  equation  would  be 


0  \ 
0 

Spp  ) 


p 

0=  |S -AI|  =]"[(*,■,• -A.)  [by  (2.83)], 
/= l 


(12.16) 


which  has  solutions 


A/  =  su ,  i  =  l,2,...,p.  (12.17) 

The  corresponding  normalized  eigenvectors  have  a  1  in  the  / th  position  and  0’s  else¬ 
where: 


^  =  (0, ...  ,0,  1,0 . 0). 


(12.18) 


Thus  the  i  th  component  is 


Zi  =  a'y  =  yt. 

In  practice,  the  sample  correlations  (of  continuous  random  variables)  will  not  be  zero, 
but  if  the  correlations  are  all  small,  the  principal  components  will  largely  duplicate 
the  variables. 

By  the  Perron-Forbenius  theorem  in  Section  2.11 .4,  if  all  correlations  or  covari¬ 
ances  are  positive,  all  elements  of  the  first  eigenvector  ai  are  positive.  Since  the 
remaining  eigenvectors  a2,  a3,  . . .  ,ap  are  orthogonal  to  aj,  they  must  have  both 
positive  and  negative  elements.  When  all  elements  of  ai  are  positive,  the  first  compo¬ 
nent  is  a  weighted  average  of  the  variables  and  is  sometimes  referred  to  as  a  measure 
of  size.  Likewise,  the  positive  and  negative  coefficients  in  subsequent  components 
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may  be  regarded  as  defining  shape.  This  pattern  is  often  seen  when  the  variables  are 
various  measurements  of  an  organism. 


Example  12.8.1.  In  the  modified  football  data  of  Example  12.2.2,  there  are  a  few 
negative  covariances  in  S,  but  they  are  small,  and  all  elements  of  the  first  eigen¬ 
vector  remain  positive.  The  second  eigenvector  therefore  has  positive  and  negative 
elements: 


First  Two  Eigenvectors 


ai 

a2 

WDIM 

.207 

-.142 

CIRCUM 

.873 

-.219 

FBEYE 

.261 

-.231 

EYEHD 

.326 

.891 

EARHD 

.066 

.222 

JAW 

.128 

-.187 

With  all  positive  coefficients,  the  first  component  z,\  is  an  overall  measure  of  head 
size  (z\  increases  if  all  six  variables  increase).  The  second  component  zi  is  a  shape 
component  that  contrasts  the  vertical  measurements  EYEHD  and  EARHD  with  the 
three  lateral  measurements  and  CIRCUM  (z2  increases  if  EYEHD  and  EARHD 
increase  and  the  other  four  variables  decrease).  □ 


12.8.2  Rotation 

The  principal  components  are  initially  obtained  by  rotating  axes  in  order  to  line 
up  with  the  natural  extensions  of  the  system,  whereupon  the  new  variables  become 
uncorrelated  and  reflect  the  directions  of  maximum  variance.  If  the  resulting  com¬ 
ponents  do  not  have  a  satisfactory  interpretation,  they  can  be  further  rotated,  seeking 
dimensions  in  which  many  of  the  coefficients  of  the  linear  combinations  are  near 
zero  to  simplify  interpretation. 

However,  the  new  rotated  components  are  correlated,  and  they  do  not  successively 
account  for  maximum  variance.  They  are,  therefore,  no  longer  principal  components 
in  the  usual  sense,  and  their  routine  use  is  questionable.  For  improved  interpretation, 
you  may  wish  to  try  factor  analysis  (Chapter  13),  in  which  rotation  does  not  destroy 
any  properties.  (In  factor  analysis,  the  rotation  does  not  involve  the  space  of  the 
variables  yi,  y2,  ■  ■  ■  ,  yp,  but  another  space,  that  of  the  factor  loadings.) 

12.8.3  Correlations  between  Variables  and  Principal  Components 

The  use  of  correlations  between  variables  and  principal  components  is  widely  rec¬ 
ommended  as  an  aid  to  interpretation.  It  was  noted  in  Sections  8.7.3  and  1 1.5.2  that 
analogous  correlations  for  discriminant  functions  and  canonical  variates  are  not  use¬ 
ful  in  a  multivariate  context  because  they  provide  only  univariate  information  about 
how  each  variable  operates  by  itself,  ignoring  the  other  variables.  Rencher  (1992b) 
obtained  a  similar  result  for  principal  components. 
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We  denote  the  correlation  between  the  i  th  variable  y,  and  the  j  th  principal  com¬ 
ponent  Zj  by  ryiZj.  Because  of  the  orthogonality  of  the  zf  s,  we  have  the  simple 
relationship 

rym  +  rytZ2  ^  rytzk  =  ^ydzi.-.z*’  (12.19) 

where  k  is  the  number  of  components  retained  and  R2,  is  the  squared  multiple 

correlation  of  y;  with  the  zf  s.  Thus  ,  forms  part  of  R2.\,x  -k-  which  shows  how 

yi  relates  to  the  z’s  by  itself,  not  what  it  contributes  in  the  presence  of  the  other  y’s. 
The  correlations  are,  therefore,  not  informative  about  the  joint  contribution  of  the  y’s 
in  a  principal  component. 

Note  that  the  simple  partitioning  of  R 2  into  the  sum  of  squares  of  correlations  in 
(12.19)  does  not  happen  in  practice  when  the  independent  variables  (x’s)  are  corre¬ 
lated.  However,  here  the  z’s  are  principal  components  and  are,  therefore,  orthogonal. 

Since  we  do  not  recommend  rotation  or  correlations  for  interpretation,  we  are  left 
with  the  coefficients  themselves,  obtained  from  the  eigenvectors  of  either  S  or  R. 

Example  12.8.3.  In  Example  12.8.1,  the  eigenvectors  of  S  from  the  modified  foot¬ 
ball  data  gave  a  satisfactory  interpretation  of  the  first  two  principal  components  as 
head  size  and  shape.  We  give  these  in  Table  12.5,  along  with  the  correlations  between 
each  of  the  variables  yi,  y2,  •••  ,  )’6  and  the  first  two  principal  components  z  l  and 
Z2-  For  comparison  we  also  give  R2  ^  for  each  variable. 

The  correlations  rank  the  variables  somewhat  differently  in  their  contribution  to 
the  components,  since  they  form  part  of  the  univariate  information  provided  by  R2 
for  each  variable  by  itself.  For  example,  for  the  first  component,  the  correlations  rank 
the  variables  in  the  order  2,  3,  1,  4,  6,  5,  whereas  the  coefficients  (first  eigenvector) 
from  S  rank  them  in  the  order  2,  4,  3,  1,  6,  5.  □ 


12.9  SELECTION  OF  VARIABLES 

We  have  previously  discussed  subset  selection  in  connection  with  Wilks’  A  (Sec¬ 
tion  6.1 1.2),  discriminant  analysis  (Section  8.9),  classification  analysis  (Section  9.6), 

Table  12.5.  Eigenvectors  Obtained  from  S,  Correlations  between  Variables  and  Princi¬ 
pal  Components,  and  R2  for  the  First  Two  Principal  Components 


Eigenvectors  from  S  Correlations 


Variable 

ai 

a2 

rym 

ryai 

R2 

yi\zi,z2 

1 

.21 

-.14 

.62 

-.27 

.46 

2 

.87 

-.22 

.98 

-.16 

.99 

3 

.26 

-.23 

.70 

-.40 

.66 

4 

.33 

.89 

.49 

.86 

.98 

5 

.07 

.22 

.17 

.37 

.17 

6 

.13 

-.19 

.41 

-.39 

.32 
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and  regression  (Sections  10.2.7  and  10.7).  In  each  case  the  criterion  for  selection  of 
variables  was  the  relationship  of  the  variables  to  some  external  factor,  such  as  depen¬ 
dent  variable(s),  separation  of  groups,  or  correct  classification  rates.  In  the  context  of 
principal  components,  we  have  no  dependent  variable,  as  in  regression,  and  no  group¬ 
ings  among  the  observations,  as  in  discriminant  analysis.  With  no  external  influence, 
we  simply  wish  to  find  the  subset  that  best  captures  the  internal  variation  (and  covari¬ 
ation)  of  the  variables. 

Jolliffe  (1972,  1973)  discussed  eight  selection  methods  and  referred  to  the  process 
as  discarding  variables.  The  eight  methods  were  based  on  three  basic  approaches: 
multiple  correlation,  clustering  of  variables,  and  principal  components.  One  of  the 
correlation  methods,  for  example,  proceeds  in  a  stepwise  fashion,  deleting  at  each 
step  the  variable  that  has  the  largest  multiple  correlation  with  the  other  variables.  The 
clustering  methods  partition  the  variables  into  groups  or  clusters  (see  Chapter  14)  and 
select  a  variable  from  each  cluster. 

We  describe  Jolliffe’s  principal  component  methods  in  the  context  of  selecting  a 
subset  of  10  variables  out  of  50  variables.  One  of  his  techniques  associates  a  vari¬ 
able  with  each  of  the  first  10  principal  components  and  retains  these  10  variables. 
Another  approach  is  to  associate  a  variable  with  each  of  the  last  40  principal  compo¬ 
nents  and  delete  the  40  variables.  To  associate  a  variable  with  a  principal  component, 
we  choose  the  variable  corresponding  to  the  largest  coefficient  (in  absolute  value)  in 
the  component,  providing  the  variable  has  not  previously  been  selected.  We  can  use 
components  extracted  from  either  S  or  R.  For  example,  in  the  two  principal  compo¬ 
nents  for  the  football  data  in  Example  12.2.2,  we  would  choose  variables  2  and  4, 
which  clearly  have  the  largest  coefficients  in  the  two  components.  Jolliffe’s  methods 
could  also  be  applied  iteratively,  with  the  principal  components  being  recomputed 
after  a  variable  is  retained  or  deleted. 

Jolliffe  (1972)  compared  the  eight  methods  using  both  real  and  simulated  data  and 
found  that  the  methods  based  on  principal  components  performed  well  in  comparison 
to  the  regression  and  cluster-based  methods.  But  he  concluded  that  no  single  method 
was  uniformly  best. 

McCabe  (1984)  suggested  several  criteria  for  selection,  most  of  which  are  based 
on  the  conditional  covariance  matrix  of  the  variables  not  selected,  given  those 
selected.  He  denoted  the  selected  variables  as  principal  variables.  Let  y  be  parti¬ 
tioned  as 


where  yi  contains  the  selected  variables  and  y2  consists  of  the  variables  not  selected. 
The  corresponding  covariance  matrix  is 


cov(y)  =  X  = 


Xll  X12 
X21  X22 
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By  (4.8),  the  conditional  covariance  matrix  is  given  by  (assuming  normality) 
cov(y2|yi)  =  X22  -  X21X7/S12, 

which  is  estimated  by  S22  —  S2iS]“11Si2-  To  find  a  subset  yi  of  size  m,  two  of 
McCabe’s  criteria  are  to  choose  the  subset  yi  that 

1.  minimizes  |S22  —  S2iS[j1Si2|  and 

2.  maximizes  Y1T=  1  ri>  where  ,  i  —  1,2,...  ,  m*  =  min(m,  p  —  m )  are  the 
canonical  correlations  between  the  m  selected  variables  in  yi  and  the  p  —  m 
deleted  variables  in  y2. 


Ideally,  these  criteria  would  be  evaluated  for  all  possible  subsets  so  as  to  obtain 
the  best  subset  of  each  size.  McCabe  suggested  a  regression  approach  for  obtaining 
a  percent  of  variance  explained  by  a  subset  of  variables  to  be  compared  with  the 
percent  of  variance  accounted  for  by  the  same  number  of  principal  components. 


PROBLEMS 

12.1  Show  that  the  solutions  to  X  —  a'Sa/a'a  in  (12.7)  are  given  by  the  eigenvalues 
and  eigenvectors  in  (12.8),  so  that  7.  in  (12.7)  is  maximized  by  the  largest 
eigenvalue  of  S. 

12.2  Show  that  the  eigenvalues  of 


are  1  ±  r,  as  in  (12.13),  and  that  the  eigenvectors  are  as  given  in  (12.14). 

12.3  (a)  Give  a  justification  based  on  the  likelihood  ratio  for  the  test  statistic  u  in 

(12.15). 

(b)  Give  a  justification  for  the  degrees  of  freedom  v  =  \(k  —  \  )(k  +  2)  for 
the  test  statistic  in  (12.15). 

12.4  Show  that  when  S  is  diagonal  as  in  (12.16),  the  eigenvectors  have  the  form 
a'.  =  (0, . . .  ,  0,  1,  0, . . .  ,  0),  as  given  in  (12.18). 

12.5  Show  that  >jiZ]  +  ,jiZ2  +  ■  ■  ■  +  r)ak  =  R2yAzi_M,  as  in  (12.19). 

12.6  Carry  out  a  principal  component  analysis  of  the  diabetes  data  of  Table  3.4. 
Use  all  five  variables,  including  y’s  and  x’s.  Use  both  S  and  R.  Which  do 
you  think  is  more  appropriate  here?  Show  the  percent  of  variance  explained. 
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Based  on  the  average  eigenvalue  or  a  scree  plot,  decide  how  many  components 
to  retain.  Can  you  interpret  the  components  of  either  S  or  R? 

12.7  Do  a  principal  component  analysis  of  the  probe  word  data  of  Table  3.5.  Use 
both  S  and  R.  Which  do  you  think  is  more  appropriate  here?  Show  the  percent 
of  variance  explained.  Based  on  the  average  eigenvalue  or  a  scree  plot,  decide 
how  many  components  to  retain.  Can  you  interpret  the  components  of  either 
SorR? 

12.8  Carry  out  a  principal  component  analysis  on  all  six  variables  of  the  glucose 
data  of  Table  3.8.  Use  both  S  and  R.  Which  do  you  think  is  more  appro¬ 
priate  here?  Show  the  percent  of  variance  explained.  Based  on  the  average 
eigenvalue  or  a  scree  plot,  decide  how  many  components  to  retain.  Can  you 
interpret  the  components  of  either  S  or  R? 

12.9  Carry  out  a  principal  component  analysis  on  the  hematology  data  of  Table  4.3. 
Use  both  S  and  R  .  Which  do  you  think  is  more  appropriate  here?  Show  the 
percent  of  variance  explained.  Based  on  the  average  eigenvalue  or  a  scree 
plot,  decide  how  many  components  to  retain.  Can  you  interpret  the  compo¬ 
nents  of  either  S  or  R?  Does  the  large  variance  of  yi  affect  the  pattern  of  the 
components  of  S? 

12.10  Carry  out  a  principal  component  analysis  separately  for  males  and  females  in 
the  psychological  data  of  Table  5.1.  Compare  the  results  for  the  two  groups. 
Use  S. 

12.11  Carry  out  a  principal  component  analysis  separately  for  the  two  species  in  the 
beetle  data  of  Table  5.5.  Compare  the  results  for  the  two  groups.  Use  S. 

12.12  Carry  out  a  principal  component  analysis  on  the  engineer  data  of  Table  5.6  as 
follows: 

(a)  Use  the  pooled  covariance  matrix. 

(b)  Ignore  groups  and  use  a  covariance  matrix  based  on  all  40  observations. 

(c)  Which  of  the  approaches  in  (a)  or  (b)  appears  to  be  more  successful? 

12.13  Repeat  the  previous  problem  for  the  dystrophy  data  of  Table  5.7. 

12.14  Carry  out  a  principal  component  analysis  on  all  10  variables  of  the  Seishu  data 
of  Table  7.1.  Use  both  S  and  R.  Which  do  you  think  is  more  appropriate  here? 
Show  the  percent  of  variance  explained.  Based  on  the  average  eigenvalue  or 
a  scree  plot,  decide  how  many  components  to  retain.  Can  you  interpret  the 
components  of  either  S  or  R? 

12.15  Carry  out  a  principal  component  analysis  on  the  temperature  data  of  Table  7.2. 
Use  both  S  and  R.  Which  do  you  think  is  more  appropriate  here?  Show  the 
percent  of  variance  explained.  Based  on  the  average  eigenvalue  or  a  scree  plot, 
decide  how  many  components  to  retain.  Can  you  interpret  the  components  of 
either  S  or  R? 


CHAPTER  13 


Factor  Analysis 


13.1  INTRODUCTION 

In  factor  analysis  we  represent  the  variables  yi,  yi, ...  ,yp  as  linear  combinations 
of  a  few  random  variables  f\,  fi, . . .  ,  fm  («i  <  p)  called  factors.  The  factors  are 
underlying  constructs  or  latent  variables  that  “generate”  the  y’s.  Like  the  original 
variables,  the  factors  vary  from  individual  to  individual;  but  unlike  the  variables,  the 
factors  cannot  be  measured  or  observed.  The  existence  of  these  hypothetical  variables 
is  therefore  open  to  question. 

If  the  original  variables  y\,yi, . . .  ,  yp  are  at  least  moderately  correlated,  the  basic 
dimensionality  of  the  system  is  less  than  p.  The  goal  of  factor  analysis  is  to  reduce 
the  redundancy  among  the  variables  by  using  a  smaller  number  of  factors. 

Suppose  the  pattern  of  the  high  and  low  correlations  in  the  correlation  matrix  is 
such  that  the  variables  in  a  particular  subset  have  high  correlations  among  them¬ 
selves  but  low  correlations  with  all  the  other  variables.  Then  there  may  be  a  single 
underlying  factor  that  gave  rise  to  the  variables  in  the  subset.  If  the  other  variables 
can  be  similarly  grouped  into  subsets  with  a  like  pattern  of  correlations,  then  a  few 
factors  can  represent  these  groups  of  variables.  In  this  case  the  pattern  in  the  correla¬ 
tion  matrix  corresponds  directly  to  the  factors.  For  example,  suppose  the  correlation 
matrix  has  the  form 


/ 1.00 

.90 

.05 

.05 

,05\ 

.90 

1.00 

.05 

.05 

.05 

.05 

.05 

1.00 

.90 

.90 

.05 

.05 

.90 

1.00 

.90 

l  .05 

.05 

.90 

.90 

1 .00, 

Then  variables  1  and  2  correspond  to  a  factor,  and  variables  3,  4,  and  5  correspond 
to  another  factor.  In  some  cases  where  the  correlation  matrix  does  not  have  such  a 
simple  pattern,  factor  analysis  will  still  partition  the  variables  into  clusters. 

Factor  analysis  is  related  to  principal  component  analysis  in  that  both  seek  a  sim¬ 
pler  structure  in  a  set  of  variables  but  they  differ  in  many  respects  (see  Section  13.8). 
For  example,  two  differences  in  basic  approach  are  as  follows: 
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1.  Principal  components  are  defined  as  linear  combinations  of  the  original  vari¬ 
ables.  In  factor  analysis,  the  original  variables  are  expressed  as  linear  combi¬ 
nations  of  the  factors. 

2.  In  principal  component  analysis,  we  explain  a  large  part  of  the  total  variance  of 
the  variables,  ,v;/.  In  factor  analysis,  we  seek  to  account  for  the  covariances 
or  correlations  among  the  variables. 

In  practice,  there  are  some  data  sets  for  which  the  factor  analysis  model  does  not 
provide  a  satisfactory  fit.  Thus,  factor  analysis  remains  somewhat  subjective  in  many 
applications,  and  it  is  considered  controversial  by  some  statisticians.  Sometimes  a 
few  easily  interpretable  factors  emerge,  but  for  other  data  sets,  neither  the  number 
of  factors  nor  the  interpretation  is  clear.  Some  possible  reasons  for  these  failures  are 
discussed  in  Section  13.7. 


13.2  ORTHOGONAL  FACTOR  MODEL 
13.2.1  Model  Definition  and  Assumptions 

Factor  analysis  is  basically  a  one-sample  procedure  [for  possible  applications  to  data 
with  groups,  see  Rencher  (1998,  Section  10.8)].  We  assume  a  random  sample  yi, 
y2,  . . .  ,  y„  from  a  homogeneous  population  with  mean  vector  /x  and  covariance 
matrix  X- 

The  factor  analysis  model  expresses  each  variable  as  a  linear  combination  of 
underlying  common  factors  f\,  /2, . . .  ,  fm,  with  an  accompanying  error  term  to 
account  for  that  part  of  the  variable  that  is  unique  (not  in  common  with  the  other 
variables).  For  yi,  y2,  •  •  ■  ,  }’p  in  any  observation  vector  y,  the  model  is  as  follows: 

>’i  —  Mr  =  T-ll/l  +  7-12/2  +  •  •  •  +  T-i  mfm  +  £1 

yi  ~  M2  =  7.2i/i  +  ^22/2  + - h  7-2 mfm  +  s2 

.  (13.1) 

y  p  t^p  =  7-pl /l  4"  7.^2/2  “b  ‘  ‘  ‘  "t~  7. pmfm  T  £p- 

Ideally,  m  should  be  substantially  smaller  than  p\  otherwise  we  have  not  achieved  a 
parsimonious  description  of  the  variables  as  functions  of  a  few  underlying  factors. 
We  might  regard  the  /’ s  in  (13.1)  as  random  variables  that  engender  the  y’s.  The 
coefficients  Xjj  are  called  loadings  and  serve  as  weights,  showing  how  each  y,  indi¬ 
vidually  depends  on  the  f’s.  (In  this  chapter,  we  defer  to  common  usage  in  the  factor 
analysis  literature  and  use  the  notation  Xjj  for  loadings  rather  than  eigenvalues.)  With 
appropriate  assumptions,  Xjj  indicates  the  importance  of  the  y'th  factor  fj  to  the  /  th 
variable  y,-  and  can  be  used  in  interpretation  of  fj.  We  describe  or  interpret  /2,  for 
example,  by  examining  its  coefficients,  X12,  7-22.  •  ■  ■  ,  Xp2-  The  larger  loadings  relate 
/2  to  the  corresponding  v’s.  From  these  y’s,  we  infer  a  meaning  or  description  of  /2. 
After  estimating  the  Xjj’s  (and  rotating  them;  see  Sections  13.2.2  and  13.5),  it  is 
hoped  they  will  partition  the  variables  into  groups  corresponding  to  factors. 
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The  system  of  equations  (13.1)  bears  a  superficial  resemblance  to  the  multiple 
regression  model  (10.1),  but  there  are  fundamental  differences.  For  example,  (1)  the 
f’s  are  unobserved  and  (2)  the  model  in  (13.1)  represents  only  one  observation  vec¬ 
tor,  whereas  (10.1)  depicts  all  n  observations. 

It  is  assumed  that  for  j  —  1,  2, . . .  .  m,  E(fj )  =  0,  var (  fj)  —  1,  and 
co v(/y,  fk)  =  0,  j  ^  k.  The  assumptions  for  £,-,  i  —  1,2,...,  p,  are  similar, 
except  that  we  must  allow  each  £,■  to  have  a  different  variance,  since  it  shows  the 
residual  part  of  y;  that  is  not  in  common  with  the  other  variables.  Thus  we  assume 
that  E(si)  =  0,  var(e,)  =  ij/j ,  and  co v(e,-,  Sk)  =  0,  i  ^  k.  In  addition,  we  assume 
that  cov(e,,  fj)  —  0  for  all  i  and  j .  We  refer  to  i/q  as  the  specific  variance. 

These  assumptions  are  natural  consequences  of  the  basic  model  (13.1)  and  the 
goals  of  factor  analysis.  Since  E(yj  —  m)  —  0,  we  need  E(fj)  —  0,  j  =  1,2,...  ,m. 
The  assumption  co v(fj,  fk)  =  0  is  made  for  parsimony  in  expressing  the  v’s  as 
functions  of  as  few  factors  as  possible.  The  assumptions  var  (fj)  =  1,  var(e,)  =  if , 
cov ( f j ,  fk)  —  0,  and  cov(e;,  fj)  =  0  yield  a  simple  expression  for  the  variance 
of  yi. 


var(v/)  —  +  A?2  +  •  •  •  +  Ejm  +  fii,  (13.2) 


which  plays  an  important  role  in  our  development.  Note  that  the  assumption 
co v(sj,ek)  —  0  implies  that  the  factors  account  for  all  the  correlations  among 
the  y’s,  that  is,  all  that  the  y’s  have  in  common.  Thus  the  emphasis  in  factor  analysis 
is  on  modeling  the  covariances  or  correlations  among  the  y’s. 

Model  (13.1)  can  be  written  in  matrix  notation  as 


y  —  pc  =  Af  +  e, 


(13.3) 


where  y  =  (vi,  y2,  . . .  ,  yp)',  n  =  (/q,  H2,  ■  ■  ■  ,  Up)',  f  =  (/l,  fl,  ■  •  »  ,  fin)',  e  = 
(ei,  e2, . . .  ,  ep)' ,  and 


A  = 


(  ^11  ^12  •••  Mm  \ 

Ml  Ml  •  •  •  Mm 


\  hpi  hp2 


hpm  J 


(13.4) 


We  illustrate  the  model  in  (13.1)  and  (13.3)  with  p  =  5  and  m  =  2.  The  model  for 
each  variable  in  (13.1)  becomes 
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y  i  —  am  =  ^11/1  +  7.12/2  +  si 

T2  ~  H  2  =  ^2l/l  +  ^22/2  +  £2 
V3  —  /X3  =  A.31/1  +  A  32/2  +  £3 
T4  —  4  —  7-41 ,/)  +  ^42/2  +  £4 

J5  -  At5  =  ^5l/l  +  ^52/2  +  £5- 

In  matrix  notation  as  in  (13.3),  this  becomes 


(  y\-  Hi  ) 

(  An 

7-12 

\ 

(  £1  ) 

y2  -  At  2 

7-21 

7-22 

1  h  , 

\ 

£2 

J3  -  M  3 

= 

7^31 

7-32 

+ 

£3 

y4  -  /14 

7.41 

7-42 

/ 

£4 

\  ys  -  At5  J 

(  A.51 

7-52 

/ 

V  £5  / 

or  y  —  fx.  =  Af  +  e. 

The  assumptions  listed  between  (13.1)  and  (13.2)  can  be  expressed  concisely 
using  vector  and  matrix  notation:  E(fj )  =  0,  j  —  1,2,...  ,  m,  becomes 


E(  f)  =  0, 


(13.6) 


var (fj)  =  1,  j  =  1,  2, . . .  ,  m ,  and  co v(/),  fk)  —  0 ,  )  /  k,  become 


cov(f)  =  I,  (13.7) 

E(si)  =  0,  i  —  1,  2, . . .  ,  p,  becomes 

E(e)  =  0,  (13.8) 


var(e,-)  =  \[rj,  i  =  1,2,...  ,  p,  and  cov(e,-,  Sk)  —  0,  i  ^  k ,  become 


(  ti  0 

0  V/'2 

cov(e)  'I' 

v  0  0 

and  cov(e/,  fj)  =  0  for  all  i  and  j  becomes 


0  \ 
0 

f P  ) 


cov(f,  e)  =  O. 


(13.9) 


(13.10) 


The  notation  cov(f,  e)  indicates  a  rectangular  matrix  containing  the  covariances  of 
the  /’ s  with  the  e’s: 
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afiei 

CT/l«2 

'  ’  CT/l«P 

ff/2£l 

CT/2£2 

'  ’  af2Bp 

afmS  1 

a.fmS2  ’ 

'  '  afm£p 

\ 


It  was  noted  following  (13.2)  that  the  emphasis  in  factor  analysis  is  on  modeling 
the  covariances  among  the  y ’s.  We  wish  to  express  the  4 pip— 1)  covariances  (and  the 
p  variances)  of  the  variables  vi,  yi,  ■  ■  ■  ,  yp  in  terms  of  a  simplified  structure  involv¬ 
ing  the  pm  loadings  A/;  and  the  p  specific  variances  i//; ;  that  is,  we  wish  to  express 
X  in  terms  of  A  and  'P.  We  can  do  this  using  the  model  (13.3)  and  the  assumptions 
(13.7),  (13.9),  and  (13.10).  Since  /x  does  not  affect  variances  and  covariances  of  y, 
we  have,  from  (13.3), 


X  —  cov(y)  =  cov(Af  +  e). 

By  (13.10),  Af  and  e  are  uncorrelated;  therefore,  the  covariance  matrix  of  their 
sum  is  the  sum  of  their  covariance  matrices: 

X  —  cov(Af)  +  cov(e) 

=  Acov(f)A'  +  W  [by  (3.74)  and  (13.9)] 

=  AIA'  +  ’P  [by  (13.7)] 

=  AA'  +  W.  (13.11) 

If  A  has  only  a  few  columns,  say  two  or  three,  then  X  —  AA'  +  "'P'  in  (13.1 1) 
represents  a  simplified  structure  for  X,  in  which  the  covariances  are  modeled  by  the 
A,-,-’ s  alone  since  'P  is  diagonal.  For  example,  in  the  illustration  in  (13.5)  with  m  =  2 
factors,  (7i2  would  be  the  product  of  the  first  two  rows  of  A,  that  is, 


(712  =  COV(yi,  yX)  =  A.nA.21  +  7-127.22, 


where  (7n,  A12)  is  the  first  row  of  A  and  (A21,  A22)  is  the  second  row  of  A.  If  vi 
and  V2  have  a  great  deal  in  common,  they  will  have  similar  loadings  on  the  common 
factors  fi  and  fi\  that  is,  (7n,  A12)  will  be  similar  to  (A.21,  A 22 )■  In  this  case,  either 
A11 A21  or  A12A22  is  likely  to  be  high.  On  the  other  hand,  if  yt  and  V2  have  little  in 
common,  then  their  loadings  An  and  A21  on  f\  will  be  different  and  their  loadings 
A12  and  A22  on  f2  will  likewise  differ.  In  this  case,  the  products  A 1 1 A21  and  A12A22 
will  tend  to  be  small. 

We  can  also  find  the  covariances  of  the  y’s  with  the  /’ s  in  terms  of  the  A’s. 
Consider,  for  example,  cov(yi,  fX).  By  (13.1),  yi  —  p\  —  An/j  +  A12/2  +  •  •  • 
+  Ai mfm  +  si-  From  (13.7),  /2  is  uncorrelated  with  all  other  /y’s,  and  by  (13.10), 
fi  is  uncorrelated  with  s\ .  Thus 
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C0V(J1,  /2)  =  £[(V1  -  Ml)(/2  -  Pf2)] 

=  £[Ull/l  +  A.  12/2  +  •  •  •  + 

=  £(^ll/l/2  +  h-llfi  +  '  '  •  +  A. Imfmfl ) 

=  ^11  cov(/i,  /2)  +  A.12  var(/2)  H - b  A.lm  cov(/m,  /2) 

=  A.12 


since  var(/2)  =  1.  Hence  the  loadings  themselves  represent  covariances  of  the  vari¬ 
ables  with  the  factors.  In  general, 

co v(yi,  fj)  =  \ij,  i  —  1,2 - -  p,  j  —  1,2,...  ,m.  (13.12) 

Since  A,y  is  the  (i /jth  element  of  A,  we  can  write  (13.12)  in  the  form 

cov(y,  f)  =  A.  (13.13) 

If  standardized  variables  are  used,  (13.11)  is  replaced  by  Pp  =  AA/  +  and  the 
loadings  become  correlations: 


corr  (yi,fj)  =  kij.  (13.14) 

In  (13.2),  we  have  a  partitioning  of  the  variance  of  y,-  into  a  component  due  to  the 
common  factors,  called  the  communality,  and  a  component  unique  to  y,-,  called  the 
specific  variance : 

an  =  var(y,)  =  (A.?j  +  Xj2  +  ■  ■  ■  +  A?m)  +  fit 
-  hj  +  fit 

=  communality  +  specific  variance, 


where 


Communality  =  hj  =  xj1  +  kj2  -( - +  ^jm,  (13.15) 

Specific  variance  =  1 fr, . 

The  communality  hj  is  also  referred  to  as  common  variance,  and  the  specific  variance 
fij  has  been  called  specificity,  unique  variance,  or  residual  variance. 

Assumptions  (13. 6)— ( 13.10)  lead  to  the  simple  covariance  structure  of  (13.11), 
2  =  AA'  +  'P,  which  is  an  essential  part  of  the  factor  analysis  model.  In  schematic 
form,  2  =  AA'  +  'P  has  the  following  appearance: 
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The  diagonal  elements  of  X  can  be  easily  modeled  by  adjusting  the  diagonal  ele¬ 
ments  of  ip,  but  A  A'  is  a  simplified  configuration  for  the  off-diagonal  elements. 
Hence  the  critical  aspect  of  the  model  involves  the  covariances,  and  this  is  the  major 
emphasis  of  factor  analysis,  as  noted  in  Section  13.1  and  in  comments  following 
(13.2)  and  (13.10). 

It  is  a  rare  population  covariance  matrix  X  that  can  be  expressed  exactly  as  X  = 
AA'  +  "ip,  where  'P  is  diagonal  and  A  is  p  x  m,  with  m  relatively  small.  In  practice, 
many  sample  covariance  matrices  do  not  come  satisfactorily  close  to  this  ideal  pat¬ 
tern.  However,  we  do  not  relax  the  assumptions  because  the  structure  X  =  AA/  +  IP 
is  essential  for  estimation  of  A. 

One  advantage  of  the  factor  analysis  model  is  that  when  it  does  not  fit  the  data, 
the  estimate  of  A  clearly  reflects  this  failure.  In  such  cases,  there  are  two  problems  in 
the  estimates:  (1 )  it  is  unclear  how  many  factors  there  should  be,  and  (2)  it  is  unclear 
what  the  factors  are.  In  other  statistical  procedures,  failure  of  assumptions  may  not 
lead  to  such  obvious  consequences  in  the  estimates  or  tests.  In  factor  analysis,  the 
assumptions  are  essentially  self-checking,  whereas  in  other  procedures,  we  typically 
have  to  check  the  assumptions  with  residual  plots,  tests,  and  so  on. 

13.2.2  Nonuniqueness  of  Factor  Loadings 

The  loadings  in  the  model  (13.3)  can  be  multiplied  by  an  orthogonal  matrix  without 
impairing  their  ability  to  reproduce  the  covariance  matrix  in  X  =  AA'  +  <P.  To  see 
this,  let  T  be  an  arbitrary  orthogonal  matrix.  Then  by  (2.102),  TT'  =  I,  and  we  can 
insert  TT'  into  the  basic  model  (13.3)  to  obtain 

y-/a  =  ATT'f+£. 

We  then  associate  T  with  A  and  associate  T'  with  f  so  that  the  model  becomes 

y  —  /a,  =  A*f*  +  e,  (13.16) 


where 


A*  =  AT, 
f  =  T'f. 


(13.17) 

(13.18) 
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If  A  in  X  =  AA'  +  M'  is  replaced  by  A*  =  AT,  we  have 

X  =  A*  A*'  +  ^  =  AT(AT)'  +  V 
=  ATT' A'  +  'I'  =  AA'  +  'I' 


since  TT'  =  I.  Thus  the  new  loadings  A*  =  AT  in  (13. 17)  reproduce  the  covariance 
matrix,  just  as  A  does  in  (13.11 ): 

X  =  A*A*'  +  'P  =  AA'  +  qf.  (13.19) 

The  new  factors  f*  =  T'f  in  (13.18)  satisfy  the  assumptions  (13.6),  (13.7),  and 
(13.10);  that  is,  E(f*)  =  0,  cov(f*)  =  I,  and  cov(f*,  e)  =  O. 

The  communalitites  h~  =  AA  +  A?,  +  ■  ■  ■  +  A?,  i  =  1,2,...  ,  p.  as  defined  in 
(13.15),  are  also  unaffected  by  the  transformation  A*  =  AT.  This  can  be  seen  as 
follows.  The  communality  hj  is  the  sum  of  squares  of  the  ith  row  of  A.  If  we  denote 
the  i  th  row  of  A  by  A',  then  the  sum  of  squares  in  vector  notation  is  hj  =  A('A; .  The 
/  th  row  of  A*  =  AT  is  A*  =  A'T,  and  the  corresponding  communality  is 

hf  =  A,*' A,*  =  A/TT'A,  =  A/A,  =  hj. 

I  l  l  l  1  l  1  l 

Thus  the  communalities  remain  the  same  for  the  new  loadings.  Note  that  hj  = 
A?i  +  A?2  +  •  •  •  +  A?  =  A('A /  is  the  distance  from  the  origin  to  the  point  A'  = 
(Aji,  A,- 2, . . .  ,  A im)  in  the  m-dimensional  space  of  the  factor  loadings.  Since  the  dis¬ 
tance  A- A,  is  the  same  as  A*  A*,  the  points  A*  are  rotated  from  the  points  A,  .  [This 
also  follows  because  A*  =  A'T,  where  T  is  orthogonal.  Multiplication  of  a  vector 
by  an  orthogonal  matrix  is  equivalent  to  a  rotation  of  axes;  see  (2. 103).] 

The  inherent  potential  to  rotate  the  loadings  to  a  new  frame  of  reference  without 
affecting  any  assumptions  or  properties  is  very  useful  in  interpretation  of  the  factors 
and  will  be  exploited  in  Section  13.5. 

Note  that  the  coefficients  (loadings)  in  (13.1)  are  applied  to  the  factors,  not  to  the 
variables,  as  they  are  in  discriminant  functions  and  principal  components.  Thus  in 
factor  analysis,  the  observed  variables  are  not  involved  in  the  rotation,  as  they  are  in 
discriminant  functions  and  principal  components. 


13.3  ESTIMATION  OF  LOADINGS  AND  COMMUNALITIES 

In  the  Sections  13.3.1-13.3.4,  we  discuss  four  approaches  to  estimation  of  the  load¬ 
ings  and  communalities. 

13.3.1  Principal  Component  Method 

The  first  technique  we  consider  is  commonly  called  the  principal  component  method. 
This  name  is  perhaps  unfortunate  in  that  it  adds  to  the  confusion  between  factor 
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analysis  and  principal  component  analysis.  In  the  principal  component  method  for 
estimation  of  loadings,  we  do  not  actually  calculate  any  principal  components.  The 
reason  for  the  name  is  given  following  (13.25). 

From  a  random  sample  yi,  y2,  •  •  ■  ,  y„,  we  obtain  the  sample  covariance  matrix 
S  and  then  attempt  to  find  an  estimator  A  that  will  approximate  the  fundamental 
expression  (13.11)  with  S  in  place  of  2: 

S  =  AA'  +  ip.  (13.20) 

In  the  principal  component  approach,  we  neglect  IP  and  factor  S  into  S  =  AA'. 

In  order  to  factor  S,  we  use  the  spectral  decomposition  in  (2. 109), 

S  =  CDC',  (13.21) 

where  C  is  an  orthogonal  matrix  constructed  with  normalized  eigenvectors  (c'c;  = 
1)  of  S  as  columns  and  D  is  a  diagonal  matrix  with  the  eigenvalues  61,62, .. .  ,6P  of 
S  on  the  diagonal: 


D  = 


(  6 1 

0 


0 

e2 


°  \ 
0 


V  0  0  •••  dp  j 


(13.22) 


We  use  the  notation  6j  for  eigenvalues  instead  of  the  usual  A,  in  order  to  avoid  con¬ 
fusion  with  the  notation  X, j  used  for  the  loadings. 

To  finish  factoring  CI)C'  in  (13.21)  into  the  form  AA',  we  observe  that  since  the 
eigenvalues  0,  of  the  positive  semidehnite  matrix  S  are  all  positive  or  zero,  we  can 
factor  D  into 


where 


D  =  D1/2D1/2, 


/  Voi 


0 


°  ^ 

0 


Vo  0  ...  46-p) 


With  this  factoring  of  D,  (13.21)  becomes 


S  =  CDC'  =  CD1/2D1/2C' 
=  (CD1/2)(CD1/2y. 


(13.23) 
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This  is  of  the  form  S  =  AA',  but  we  do  not  define  A  to  be  CD1/2  because  CD12 
is  p  x  p ,  and  we  are  seeking  a  A  that  is  p  x  m  with  m  <  p.  We  therefore  define 
Dj  =  diag(0i ,  02,  ■  ■  ■  ,0m)  with  the  m  largest  eigenvalues  81  >  O2  >  ■  ■  ■  >  9m  and 
Ci  =  (ci ,  C2, . . .  ,  cm)  containing  the  corresponding  eigenvectors.  We  then  estimate 
A  by  the  first  m  columns  of  CD1/2, 

A  =  ClD[/2  =  (V^C|,  y^c2, .  .  .  ,  y/d^cm)  (13.24) 


[see  (2.56)],  where  A  is  px  m ,  Ci  is  p  x  m,  and  Dj/  is  m  x  m. 

We  illustrate  the  structure  of  the  Xjj ’s  in  (13.24)  for  p  —  5  and  m  =  2: 


^  7-i  i 
X2i 

f31 

^41 
V  A5I 


Al2  ^ 

(  Cl  1  C12  ^ 

^22 

C21  C22 

(  Vo[ 

X32 

— 

C31  C32 

V  0 

7.42 

C41  C42 

^•52  ) 

V  C51  C52  ) 

(  VeTcn  *j9lC\2  ^ 

\[9\C2\  V&2C22 

= 

V9\C}\  VO2C32 

^/92C42 

\  Vta  V&2C52  / 

[by  (2.56)]. 


(13.25) 


We  can  see  in  (13.25)  the  source  of  the  term  principal  component  solution.  The 
columns  of  A  are  proportional  to  the  eigenvectors  of  S,  so  that  the  loadings  on  the  yth 
factor  are  proportional  to  coefficients  in  the  jth  principal  component.  The  factors  are 
thus  related  to  the  first  m  principal  components,  and  it  would  seem  that  interpretation 
would  be  the  same  as  for  principal  components.  But  after  rotation  of  the  loadings,  the 
interpretation  of  the  factors  is  usually  different.  The  researcher  will  ordinarily  prefer 
the  rotated  factors  for  reasons  to  be  treated  in  Section  13.5. 

By  (2.52),  the  /th  diagonal  element  of  AA'  is  the  sum  of  squares  of  the  ith  row  of 
A,  or  A('A;  =  Y!U  V/  ■  Hence  to  complete  the  approximation  of  S  in  (13.20),  we 
define 


h  =  su-Y,%j  (l326> 

;= i 

and  write 

S  =  AA'  +  V,  (13.27) 

where  IP  =  diag(i//r,  1/^2, . . .  ,  irp).  Thus  in  (13.27)  the  variances  on  the  diagonal  of 
S  are  modeled  exactly,  but  the  off-diagonal  covariances  are  only  approximate.  Again, 
this  is  the  challenge  of  factor  analysis. 
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In  this  method  of  estimation,  the  sums  of  squares  of  the  rows  and  columns  of  A 
are  equal  to  communalities  and  eigenvalues,  respectively.  This  is  easily  shown.  By 
(13.26)  and  by  analogy  with  (13.15),  the  /th  communality  is  estimated  by 

m 

hf  =  j2^j’  (13-28) 

7=1 

which  is  the  sum  of  squares  of  the  /th  row  of  A.  The  sum  of  squares  of  the  /  th 
column  of  A  is  the  jth  eigenvalue  of  S: 

E*y  =  E(>/^y)2  [by(13-25)] 

i=t  i=i 

=  «iE4- 

i=i 

=  0j,  (13.29) 

since  the  normalized  eigenvectors  (columns  of  C)  have  length  1 . 

By  (13.26)  and  (13.28),  the  variance  of  the  /th  variable  is  partitioned  into  a  part 
due  to  the  factors  and  a  part  due  uniquely  to  the  variable: 

Sji  =  hj  T 1/// 

=  A?j  +  A?2  +  •  •  •  +  +  fa.  (13.30) 

Thus  the  /th  factor  contributes  A?.  to  Sjj.  The  contribution  of  the  y th  factor  to  the 
total  sample  variance,  tr(S)  =  sn  +  S22  +  •  •  •  +  spp,  is,  therefore, 

p 

Variance  due  to  7 th  factor  =  A?  =  •  +  A^-  +  •  •  •  +  A 2 j,  (13.31) 

i=i 

which  is  the  sum  of  squares  of  loadings  in  the  /th  column  of  A.  By  (13.29),  this  is 
equal  to  the  y  th  eigenvalue,  0;.  The  proportion  of  total  sample  variance  due  to  the 
/th  factor  is,  therefore. 


ej 

tr(S)  tr(S) ' 


(13.32) 


If  the  variables  are  not  commensurate,  we  can  use  standardized  variables  and  work 
with  the  correlation  matrix  R.  The  eigenvalues  and  eigenvectors  of  R  are  then  used 
in  place  of  those  of  S  in  (13.24)  to  obtain  estimates  of  the  loadings.  In  practice,  R 
is  used  more  often  than  S  and  is  the  default  in  most  software  packages.  Since  the 
emphasis  in  factor  analysis  is  on  reproducing  the  covariances  or  correlations  rather 
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than  the  variances,  use  of  R  is  more  appropriate  in  factor  analysis  than  in  principal 
components.  In  applications,  R  often  gives  better  results  than  S. 

If  we  are  factoring  R,  the  proportion  corresponding  to  (13.32)  is 


vP  1 2 

2^i=i  kij 

tr(R) 


1L 

P 


(13.33) 


where  p  is  the  number  of  variables. 

We  can  assess  the  fit  of  the  factor  analysis  model  by  comparing  the  left  and  right 
sides  of  (13.27).  The  error  matrix 

E  =  S  -  (AA'  + 

has  zeros  on  the  diagonal  but  nonzero  off-diagonal  elements.  The  following  inequal¬ 
ity  gives  a  bound  on  the  size  of  the  elements  in  E: 

£4-<^+i+02+2  +  ...  +  02.  (13.34) 

>j 


that  is,  the  sum  of  squared  entries  in  the  matrix  E  =  S  —  (AA'  +  'P )  is  at  most  equal 
to  the  sum  of  squares  of  the  deleted  eigenvalues  of  S.  If  the  eigenvalues  are  small, 
the  residuals  in  the  error  matrix  S  —  (AA'  +  1/)  are  small  and  the  fit  is  good. 


Example  13.3.1.  To  illustrate  the  principal  component  method  of  estimation,  we  use 
a  simple  data  set  collected  by  Brown,  Williams,  and  Barlow  (1984).  A  12-year-old 
girl  made  five  ratings  on  a  9-point  semantic  differential  scale  for  each  of  seven  of  her 
acquaintances.  The  ratings  were  based  on  the  five  adjectives  kind,  intelligent,  happy, 
likeable,  and  just.  Her  ratings  are  given  in  Table  13.1. 


Table  13.1.  Perception  Data:  Ratings  on  Five  Adjectives  for  Seven  People 


People 

Kind 

Intelligent 

Happy 

Likeable 

Just 

FSM1" 

1 

5 

5 

1 

1 

SISTER 

8 

9 

7 

9 

8 

FSM2 

9 

8 

9 

9 

8 

FATHER 

9 

9 

9 

9 

9 

TEACHER 

1 

9 

1 

1 

9 

MSMfc 

9 

7 

7 

9 

9 

FSM3 

9 

7 

9 

9 

7 

a  Female  schoolmate  1. 
/?Male  schoolmate. 
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The  correlation  matrix  for  the  five  variables  (adjectives)  is  as  follows,  with  the 
larger  values  bolded: 


/ 

1.000 

.296 

.881 

.995 

.545  \ 

.296 

1.000 

-.022 

.326 

.837 

.881 

-.022 

1.000 

.867 

.130 

.995 

.326 

.867 

1.000 

.544 

V 

.545 

.837 

.130 

.544 

1.000  / 

The  boldface  values  indicate  two  groups  of  variables:  {1,3,4}  and  {2,  5}.  We  would 
therefore  expect  that  the  correlations  among  the  variables  can  be  explained  fairly 
well  by  two  factors. 

The  eigenvalues  of  Rare  3.263,  1.538,  .168,  .031,  and  0.  Thus  R  is  singular,  which 
is  possible  in  a  situation  such  as  this  with  only  seven  observations  on  five  variables 
recorded  in  a  single-digit  scale.  The  multicollinearity  among  the  variables  induced 
by  the  fifth  eigenvalue,  0,  could  be  ascertained  from  the  corresponding  eigenvector, 
as  noted  in  Section  12.7  (see  Problem  13.6). 

By  (13.33),  the  first  two  factors  account  for  (3.263  +  1 .538) /5  =  .96  of  the  total 
sample  variance.  We  therefore  extract  two  factors.  The  first  two  eigenvectors  are 


/  .537  \ 

/  -.186  \ 

.288 

.651 

.434 

and  C2  = 

-.473 

.537 

-.169 

v  .390  y 

v  .538  ^ 

Table  13.2.  Factor  Loadings  by  the  Principal  Component  Method  for  the  Perception 
Data  of  Table  13.1 


Loadings 


Variables 

V  j 

Communalities,  /t? 

Specific  Variances,  i/r,- 

Kind 

.969 

-.231 

.993 

.007 

Intelligent 

.519 

.807 

.921 

.079 

Happy 

.785 

-.587 

.960 

.040 

Likeable 

.971 

-.210 

.987 

.013 

Just 

.704 

.667 

.940 

.060 

Variance 

accounted  for 

3.263 

1.538 

4.802 

Proportion  of 

total  variance 

.653 

.308 

.960 

Cumulative 

proportion 

.653 

.960 

.960 
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When  these  are  multiplied  by  the  square  roots  of  the  respective  eigenvalues  3.263 
and  1.538  as  in  (13.25),  we  obtain  the  loadings  in  Table  13.2. 

The  communalities  in  Table  13.2  are  obtained  from  the  sum  of  squares  of  the  rows 
of  the  loadings,  as  in  (13.28).  The  first  one,  for  example,  is  (,969)2  +  ( — .23 1  )2  = 
.993.  The  specific  variances  are  obtained  from  (13.26)  as  i/r,  =  1  —  hj  using  1  in 
place  of  sn  because  we  are  factoring  R  rather  than  S.  The  variance  accounted  for  by 
each  factor  is  the  sum  of  squares  of  the  corresponding  column  of  the  loadings,  as 
in  (13.31).  By  (13.29),  the  variance  accounted  for  is  also  equal  to  the  eigenvalue  in 
each  case.  Notice  that  the  variance  accounted  for  by  the  two  factors  adds  to  the  sum 
of  the  communalities,  since  the  latter  is  the  sum  of  all  squared  loadings.  By  (13.33), 
the  proportion  of  total  variance  for  each  factor  is  the  variance  accounted  for  divided 
by  5. 

The  two  factors  account  for  96%  of  the  total  variance  and  therefore  represent 
the  five  variables  very  well.  To  see  how  well  the  two-factor  model  reproduces  the 
correlation  matrix,  we  examine 


/ 

969 

519 

-.231 

.807 

\ 

/  .969 

.519 

.785 

785 

-.587 

(-.231 

.807 

-.587 

971 

-.210 

V 

704 

.667 

/ 

/.007 

0 

0 

0 

0\ 

0 

.079 

0 

0 

0 

+ 

0 

0 

.040 

0 

0 

0 

0 

0  .013 

0 

l  0 

0 

0 

0  .060/ 

/ 1.000 

.317 

.896 

.990 

.528\ 

317 

1.000 

-.066 

.335 

.904 

896 

-.066 

1.000 

.885 

.161 

990 

.335 

.885 

1.000 

.543 

V  ■ 

528 

.904 

.161 

.543 

1.000/ 

.971  ,704\ 

-.210  .667 / 


which  is  very  close  to  the  original  R.  We  will  not  attempt  to  interpret  the  factors  at 
this  point  but  will  wait  until  they  have  been  rotated  in  Section  13.5.2.  □ 


13.3.2  Principal  Factor  Method 

In  the  principal  component  approach  to  estimation  of  the  loadings,  we  neglected 
tp  and  factored  S  or  R.  The  principal  factor  method  (also  called  the  principal  axis 
method)  uses  an  initial  estimate  'P  and  factors  S  —  M'  or  R  —  'P  to  obtain 

S  _  rp  =  AA\ 

R  -  ip  =  AA', 


(13.36) 

(13.37) 
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where  A  is  p  x  m  and  is  calculated  as  in  (13.24)  using  eigenvalues  and  eigenvectors 
of  S  —  ip  or  R  —  ip. 

The  i  th  diagonal  element  of  S  —  'P  is  given  by  ,vI(  —  i//, ,  which  is  the  ith  cornmu- 
nality,  hj  =  sa  —  i//,-  [see  (13.30)].  Likewise,  the  diagonal  elements  of  Ii  'P  are  the 
communalities  hj  —  1  —  i/f,-.  (Clearly,  i//,-  and  hj  have  different  values  for  S  than  for 
R.)  With  these  diagonal  values,  S  —  *P  and  R  'P  have  the  form 


S  -  W  = 


R-  ip  = 


hj 

S12  ' 

S\p 

S21 

to  to 

‘  '  $2 p 

Spl 

Sp2 

■■  K 

hj 

n  2  • 

■ ' 

r21 

,  . 

to  to 

rip 

rP  1 

rP  2  • 

(13.38) 


(13.39) 


A  popular  initial  estimate  for  a  communality  in  R  —  'P  is  hj  =  Rj,  the  squared 
multiple  correlation  between  y,  and  the  other  p  —  1  variables.  This  can  be  found  as 


=  Rf  =  1  -  _ 


(13.40) 


where  r"  is  the  ith  diagonal  element  of  R  1 . 

For  S  —  IP,  an  initial  estimate  of  communality  analogous  to  (13.40)  is 


hf  =  su 


(13.41) 


where  $a  is  the  ith  diagonal  element  of  S  and  s"  is  the  ith  diagonal  element  of  S  1 
It  can  be  shown  that  (13.41)  is  equivalent  to 


hj  =  sn  -  -'-=sllRf,  (13.42) 

s 

which  is  a  reasonable  estimate  of  the  amount  of  variance  that  y,  has  in  common  with 
the  other  y’s. 

To  use  (13.40)  or  (13.41),  R  or  S  must  be  nonsingular.  If  R  is  singular,  we  can  use 
the  absolute  value  or  the  square  of  the  largest  correlation  in  the  ith  row  of  R  as  an 
estimate  of  communality. 

After  obtaining  communality  estimates,  we  calculate  eigenvalues  and  eigenvec¬ 
tors  of  S  -  'P  or  R  -  iR  and  use  (13.24)  to  obtain  estimates  of  factor  loadings,  A. 
Then  the  columns  and  rows  of  A  can  be  used  to  obtain  new  eigenvalues  (variance 
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explained)  and  communalities,  respectively.  The  sum  of  squares  of  the  /  til  column 
of  A  is  the  jth  eigenvalue  of  S  *P  or  R  \P,  and  the  sum  of  squares  of  the  /  th 
row  of  A  is  the  communality  of  y, .  The  proportion  of  variance  explained  by  the  jth 
factor  is 


tr(S  -  V)  £f=1  % 
or 


tr(R  -  'P)  Ef=1  °i  ’ 

where  Oj  is  the  jth  eigenvalue  of  S  —  M'  or  R  —  .  The  matrices  S  —  "*P  and  R  'P 

are  not  necessarily  positive  semidefinite  and  will  often  have  some  small  negative 
eigenvalues.  In  such  a  case,  the  cumulative  proportion  of  variance  will  exceed  1  and 
then  decline  to  1  as  the  negative  eigenvalues  are  added.  [Note  that  loadings  cannot 
be  obtained  by  (13.24)  for  the  negative  eigenvalues.] 


Example  13.3.2.  To  illustrate  the  principal  factor  method,  we  use  the  perception 
data  from  Table  13.1.  The  correlation  matrix  as  given  in  Example  13.3.1  is  singu¬ 
lar.  Hence  in  place  of  multiple  correlations  as  communality  estimates,  we  use  (the 
absolute  value  of)  the  largest  correlation  in  each  row  of  R.  [The  multiple  correla¬ 
tion  of  y  with  several  variables  is  greater  than  the  simple  correlation  of  y  with  any 
of  the  individual  variables;  see,  for  example,  Rencher  (2000,  p.  240).]  The  diagonal 
elements  of  R  —  ^  as  given  by  (13.39)  are,  therefore,  .995,  .837,  .881,  .995,  and 
.837,  which  are  obtained  from  R  in  (13.35).  The  eigenvalues  of  R  —  'P  are  3.202, 
1.395,  .030,  —.0002,  and  —.080,  whose  sum  is  4.546.  The  first  two  eigenvectors  of 
R  —  ^P  are 


.548  \ 

(  -.178  \ 

.272 

.656 

.431 

and  C2  = 

-.460 

.549 

-.159 

.373  ) 

l  -549  / 

When  these  are  multiplied  by  the  square  roots  of  the  respective  eigenvalues,  we 
obtain  the  principal  factor  loadings.  In  Table  13.3,  these  are  compared  with  the  load¬ 
ings  obtained  by  the  principal  component  method  in  Example  13.3.1.  The  two  sets 
of  loadings  are  very  similar,  as  we  would  have  expected  because  of  the  large  size 
of  the  communalities.  The  communalities  in  Table  13.3  are  for  the  principal  factor 
loadings,  as  noted  above.  The  proportion  of  variance  in  each  case  for  the  principal 
factor  loadings  is  obtained  by  dividing  the  variance  accounted  for  (eigenvalue)  by 
the  sum  of  the  eigenvalues,  4.546;  for  example,  3.202/4.546  =  .704.  □ 
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Table  13.3.  Loadings  Obtained  by  Two  Different  Methods  for  Perception  Data  of 
Table  13.1 


Principal  Component 
Loadings 

Principal  Factor 
Loadings 

Variables 

fi 

h 

/i 

h 

Communalities 

Kind 

.969 

-.231 

.981 

-.210 

.995 

Intelligent 

.519 

.807 

.487 

.774 

.837 

Happy 

.785 

-.587 

.111 

-.544 

.881 

Likeable 

.971 

-.210 

.982 

-.188 

.995 

Just 

.704 

.667 

.667 

.648 

.837 

Variance 

accounted  for 

3.263 

1.538 

3.202 

1.395 

Proportion 

of  total  variance 

.653 

.308 

.704 

.307 

Cumulative 

proportion 

.653 

.960 

.704 

1.01 

13.3.3  Iterated  Principal  Factor  Method 

The  principal  factor  method  can  easily  be  iterated  to  improve  the  estimates  of  com- 
munality.  After  obtaining  A  from  S  —  or  R  in  (13.36)  or  (13.37)  using  initial 
communality  estimates,  we  can  obtain  new  communality  estimates  from  the  loadings 
in  A  using  (13.28), 


j= i 


These  values  of  hj  are  substituted  into  the  diagonal  of  S  —  'P  or  R  —  M',  from 
which  we  obtain  a  new  value  of  A  using  (13.24).  This  process  is  continued  until  the 
communality  estimates  converge.  (For  some  data  sets,  the  iterative  procedure  does 
not  converge.)  Then  the  eigenvalues  and  eigenvectors  of  the  final  version  of  S  —  'P 

or  R_  ijr 

are  used  in  (13.24)  to  obtain  the  loadings. 

The  principal  factor  method  and  iterated  principal  factor  method  will  typically 
yield  results  very  close  to  those  from  the  principal  component  method  when  either 
of  the  following  is  true. 


1.  The  correlations  are  fairly  large,  with  a  resulting  small  value  of  m. 

2.  The  number  of  variables,  p,  is  large. 


A  shortcoming  of  the  iterative  approach  is  that  sometimes  it  leads  to  a  commu¬ 
nality  estimate  hr  exceeding  1  (when  factoring  R).  Such  a  result  is  known  as  a  Hey- 
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wood  case  (Heywood  1931).  If  hj  >  1,  then  1//7  <  0  by  (13.26)  and  (13.28),  which 
is  clearly  improper,  since  we  cannot  have  a  negative  specific  variance.  Thus  when  a 
communality  exceeds  1,  the  iterative  process  should  stop,  with  the  program  reporting 
that  a  solution  cannot  be  reached.  Some  software  programs  have  an  option  of  contin¬ 
uing  the  iterations  by  setting  the  communality  equal  to  1  in  all  subsequent  iterations. 
The  resulting  solution  with  1 /q  =  0  is  somewhat  questionable  because  it  implies 
exact  dependence  of  a  variable  on  the  factors,  a  possible  but  unlikely  outcome. 

Example  13.3.3.  We  illustrate  the  iterated  principal  factor  method  using  the  Seishu 
data  in  Table  7.1.  The  correlation  matrix  is  as  follows: 


1.00 

.56 

.22 

.10 

.20 

-.04 

.13 

.03 

-.07 

.09  \ 

.56 

1.00 

-.09 

.13 

.20 

-.17 

.17 

.24 

.16 

.06 

.22 

-.09 

1.00 

.16 

.70 

-.31 

-.45 

-.34 

-.11 

.68 

.10 

.13 

.16 

1.00 

.49 

-.03 

-.16 

.01 

.42 

.37 

.20 

.20 

.70 

.49 

1.00 

-.32 

-.34 

-.19 

.30 

.87 

-.04 

-.17 

-.31 

-.03 

-.32 

1.00 

-.42 

-.57 

-.11 

-.26 

.13 

.17 

-.45 

-.16 

-.34 

-.42 

1.00 

.82 

.23 

-.30 

.03 

.24 

-.34 

.01 

-.19 

-.57 

.82 

1.00 

.45 

-.17 

-.07 

.16 

-.11 

.42 

.30 

-.11 

.23 

.45 

1.00 

.29 

.09 

.06 

.68 

.37 

.87 

-.26 

-.30 

-.17 

.29 

1.00  ) 

The  eigenvalues  of  R  are  3.17,  2.56,  1.43,  1.28,  .54,  .47,  .25,  .12,  .10,  and  .06.  There 
is  a  notable  gap  between  1.28  and  .54,  and  we  therefore  extract  four  factors  (see 
Section  13.4).  The  first  four  eigenvalues  account  for  a  proportion 

3.17  +  2.56+  1.43  +  1.28 


oftr(R). 

For  initial  communality  estimates,  we  use  the  squared  multiple  correlation 
between  each  variable  and  the  other  nine  variables.  These  are  given  in  Table  13.4, 
along  with  the  final  communalities  after  iteration.  We  multiply  the  first  four  eigen¬ 
vectors  of  the  final  iterated  version  of  R  —  *P  by  the  square  roots  of  the  respec¬ 
tive  eigenvalues,  as  in  (13.24),  to  obtain  the  factor  loadings  given  in  Table  13.4. 
We  will  not  attempt  to  interpret  the  factors  until  after  they  have  been  rotated  in 
Example  13.5.2(b).  □ 

13.3.4  Maximum  Likelihood  Method 

If  we  assume  that  the  observations  yi,  y2,  ■  ■  •  ,  y,(  constitute  a  random  sample  from 
Np(n,  2),  then  A  and  *P  can  be  estimated  by  the  method  of  maximum  likelihood.  It 
can  be  shown  that  the  estimates  A  and  ^P  satisfy  the  following: 

S^PA  =  A(I  +  A''»p-1A), 


(13.43) 
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Table  13.4.  Iterated  Principal  Factor  Loadings  and  Communalities  for  the  Seishu  Data 


Variable 

/i 

Loadings 

fi  h 

U 

Initial 

Communalities 

Final 

Communalities 

Taste 

.22 

.31 

.92 

.12 

.57 

1.00 

Odor 

.07 

.40 

.43 

-.20 

.54 

.38 

PH 

.80 

.04 

.05 

-.40 

.78 

.79 

Acidity  1 

.41 

.22 

-.11 

.37 

.40 

.36 

Acidity  2 

.94 

.28 

-.07 

.05 

.88 

.98 

Sake-meter 

-.13 

-.67 

.10 

.56 

.77 

.79 

Reducing  sugar 

-.55 

.66 

.03 

-.11 

.79 

.75 

Total  sugar 

-.45 

.88 

-.14 

-.07 

.87 

.99 

Alcohol 

.13 

.54 

-.37 

.54 

.66 

.74 

Formyl-nitrogen 

.84 

.21 

-.17 

-.02 

.80 

.78 

Variance 

accounted  for 

3.00 

2.37 

1.25 

.96 

7.06 

7.57 

=  diag(S  -  AA'),  (13.44) 

A,4r_1A  is  diagonal.  (13.45) 

These  equations  must  be  solved  iteratively,  and  in  practice  the  procedure  may  fail  to 
converge  or  may  yield  a  Heywood  case  (Section  13.3.3). 

We  note  that  the  proportion  of  variance  accounted  for  by  the  factors,  as  given  by 
(13.32)  or  (13.33),  will  not  necessarily  be  in  descending  order  for  maximum  likeli¬ 
hood  factors,  as  it  is  for  factors  obtained  from  the  principal  component  or  principal 
factor  method. 

Example  13.3.4.  We  illustrate  the  maximum  likelihood  method  with  the  Seishu  data 
of  Table  7.1.  The  correlation  matrix  and  its  eigenvalues  were  given  in 
Example  13.3.3.  We  extract  four  factors,  as  in  Example  13.3.3.  The  iterative  solution 
of  (13.43),  (13.44),  and  (13.45)  yielded  the  loadings  and  communalities  given  in 
Table  13.5. 

The  pattern  of  the  loadings  is  different  from  that  obtained  using  the  iterated  prin¬ 
cipal  factor  method  in  Example  13.3.3,  but  we  will  not  compare  them  until  after 
rotation  in  Example  13.5.2(b).  Note  that  the  four  values  of  variance  accounted  for 
are  not  in  descending  order.  □ 


13.4  CHOOSING  THE  NUMBER  OF  FACTORS,  m 

Several  criteria  have  been  proposed  for  choosing  m,  the  number  of  factors.  We  con¬ 
sider  four  criteria,  which  are  similar  to  those  given  in  Section  12.6  for  choosing  the 
number  of  principal  components  to  retain. 


CHOOSING  THE  NUMBER  OF  FACTORS,  m 


All 


Table  13.5.  Maximum  Likelihood  Loadings  and  Communalities  for  the  Seishu  Data 


Variables 

/i 

Loadings 

h  h 

u 

Communalities 

Taste 

1.00 

0 

0 

0 

1.00 

Odor 

.45 

-.05 

.22 

.19 

.29 

PH 

.22 

.68 

-.20 

-.40 

.71 

Acidity  1 

.10 

.47 

.10 

.37 

.38 

Acidity  2 

.20 

.98 

.02 

.00 

1.00 

Sake-meter 

-.04 

-.31 

-.68 

.55 

.86 

Reducing  sugar 

.13 

-.39 

.76 

-.02 

.75 

Total  sugar 

.03 

-.22 

.96 

.02 

.98 

Alcohol 

-.07 

.31 

.52 

.60 

.72 

Formyl-nitrogen 

.02 

.79 

-.05 

-.10 

.63 

Variance 

accounted  for 

1.33 

2.66 

2.34 

1.00 

7.32 

1.  Choose  m  equal  to  the  number  of  factors  necessary  for  the  variance  accounted 
for  to  achieve  a  predetermined  percentage,  say  80%,  of  the  total  variance  tr(S) 
or  tr(R). 

2.  Choose  m  equal  to  the  number  of  eigenvalues  greater  than  the  average  eigen¬ 
value.  For  R  the  average  is  1 ;  for  S  it  is  IP¬ 

'S.  Use  the  scree  test  based  on  a  plot  of  the  eigenvalues  of  S  or  R.  If  the  graph 

drops  sharply,  followed  by  a  straight  line  with  much  smaller  slope,  choose  m 
equal  to  the  number  of  eigenvalues  before  the  straight  line  begins. 

4.  Test  the  hypothesis  that  m  is  the  correct  number  of  factors,  Hq  :  X  =  AA'  +  ^P, 
where  A  is  p  x  m . 

Method  1  applies  particularly  to  the  principal  component  method.  By  (13.32),  the 
proportion  of  total  sample  variance  (variance  accounted  for)  due  to  the  / th  factor 
from  S  is  X^f=i  "Aij /  trfS) .  The  corresponding  proportion  from  R  is  Ef=t  tfj/P,  as 
in  (13.33).  The  contribution  of  all  m  factors  to  tr(S)  or  p  is  therefore  YHJ=  l  tfj  > 

which  is  the  sum  of  squares  of  all  elements  of  A.  For  the  principal  component 
method,  we  see  by  (13.28)  and  (13.29)  that  this  sum  is  also  equal  to  the  sum  of 
the  first  m  eigenvalues  or  to  the  sum  of  all  p  communalities: 

pm  p  m 

=  =!>■  (13-46) 
f=l  7  =  1  i'=l  7=1 

Thus  we  choose  m  sufficiently  large  so  that  the  sum  of  the  communalities  or  the  sum 
of  the  eigenvalues  (variance  accounted  for)  constitutes  a  relatively  large  portion  of 
tr(S)  or  p. 


428 


FACTOR  ANALYSIS 


Method  1  can  be  extended  to  the  principal  factor  method,  where  prior  estimates 
of  communalities  are  used  to  form  S  'P  or  R  ML  However,  S  'P  or  R  'P  will 
often  have  some  negative  eigenvalues.  Therefore,  as  values  of  m  range  from  1  to  p, 
the  cumulative  proportion  of  eigenvalues,  Yl’j=i  @j/  '52!j=  l  wiU  exceed  1.0  and 
then  reduce  to  1 .0  as  the  negative  eigenvalues  are  added.  Hence  a  percentage  such  as 
80%  will  be  reached  for  a  lower  value  of  m  than  would  be  the  case  for  S  or  R,  and  a 
better  strategy  might  be  to  choose  m  equal  to  the  value  for  which  the  percentage  first 
exceeds  100%. 

In  the  iterated  principal  factor  method,  m  is  specified  before  iteration,  and  hj 
is  obtained  after  iteration  as  JT  hj  =  tr(S  —  *P).  To  choose  m  before  iterating,  one 
could  use  a  priori  considerations  or  the  eigenvalues  of  S  or  R.  as  in  the  principal 
component  method. 

Method  2  is  a  popular  criterion  of  long  standing  and  is  the  default  in  many  soft¬ 
ware  packages.  Although  heuristically  based,  it  often  works  well  in  practice.  A  vari¬ 
ation  to  method  2  that  has  been  suggested  for  use  with  R  —  *P  is  to  let  m  equal  the 
number  of  positive  eigenvalues.  (There  will  typically  be  some  negative  eigenvalues 
of  R  —  iR.)  However,  this  criterion  will  often  result  in  too  many  factors,  since  the 
sum  of  the  positive  eigenvalues  will  exceed  the  sum  of  the  communalities. 

The  scree  test  in  method  3  was  named  after  the  geological  term  scree ,  referring  to 
the  debris  at  the  bottom  of  a  rocky  cliff.  It  also  performs  well  in  practice. 

In  method  4  we  wish  to  test 


H0:X  =  AA'  +  ^P  vs.  H\ :  £  ^  AA'  +  W, 


where  A  is  p  x  m.  The  test  statistic,  a  function  of  the  likelihood  ratio,  is 


2p  +  4  m  +  1 1  \ 

-  1  In 

6  ) 


|AA'  +  ^|\ 

|S|  )’ 


(13.47) 


which  is  approximately  Xy  when  Hq  is  true,  where  v  —  j\(p  —  m )2  —  p  —  m\  and 
A  and  *P  are  the  maximum  likelihood  estimators.  Rejection  of  Hq  implies  that  m  is 
too  small  and  more  factors  are  needed. 

In  practice,  when  n  is  large,  the  test  in  method  4  often  shows  more  factors  to  be 
significant  than  do  the  other  three  methods.  We  may  therefore  consider  the  value  of 
m  indicated  by  the  test  to  be  an  upper  bound  on  the  number  of  factors  with  practical 
importance. 

For  many  data  sets,  the  choice  of  m  will  not  be  obvious.  This  indeterminacy  leaves 
many  statisticians  skeptical  as  to  the  validity  of  factor  analysis.  A  researcher  may 
begin  with  one  of  the  methods  (say,  method  2)  for  an  initial  choice  of  m ,  will  inspect 
the  resulting  percent  of  tr(R)  or  tr(S),  and  will  then  examine  the  rotated  loadings 
for  interpretability.  If  the  percent  of  variance  or  the  interpretation  does  not  seem 
satisfactory,  the  experimenter  will  try  other  values  of  m  in  a  search  for  an  acceptable 
compromise  between  percent  of  tr(R)  and  interpretability  of  the  factors.  Admittedly, 
this  is  a  subjective  procedure,  and  for  such  data  sets  one  could  well  question  the 
outcome  (see  Section  13.7). 
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When  a  data  set  is  successfully  fitted  by  a  factor  analysis  model,  the  first  three 
methods  will  almost  always  give  the  same  value  of  m,  and  there  will  be  little  question 
as  to  what  this  value  should  be.  Thus  for  a  “good”  data  set,  the  entire  procedure 
becomes  much  more  objective. 

Example  13.4(a).  We  compare  the  four  methods  of  choosing  m  for  the  perception 
data  used  in  Examples  13.3.1  and  13.3.2. 

Method  1  gives  m  —  2,  because  one  eigenvalue  accounts  for  65%  of  tr(R),  and 
two  eigenvalues  account  for  96%. 

Method  2  gives  m  =  2,  since  \.i  =  1.54  and  A.3  =  .17. 

For  method  3,  we  examine  the  scree  plot  in  Figure  13.1.  It  is  clear  that  m  —  2  is 
indicated. 

Method  4  is  not  available  for  the  perception  data  because  R  is  singular  (fifth  eigen¬ 
value  is  zero),  and  the  test  involves  |R|. 

Hence  for  the  perception  data,  all  three  available  methods  agree  on  m  =2.  □ 

Example  13.4(b).  We  compare  the  four  methods  of  choosing  m  for  the  Seishu  data 
used  in  Examples  13.3.3  and  13.3.4. 

Method  1  gives  m  —  4  for  the  principal  component  method,  because  four  eigen¬ 
values  of  R  account  for  82%  of  tr(R).  For  the  principal  factor  method  with  initial 
communality  estimates  Rf,  the  eigenvalues  of  R  'E  and  corresponding  proportions 
are  as  follows: 


Eigenvalues 

2.86 

2.17 

.94 

.88 

.12 

.08 

.01 

-.06 

-.13 

-.22 

Proportions 

.43 

.33 

.14 

.16 

.02 

.01 

.00 

-.01 

-.02 

-.03 

Cumulative 

proportions 

.43 

.76 

.90 

1.03 

1.05 

1.06 

1.06 

1.06 

1.03 

1.00 

Eigenvalue  Number 

Figure  13.1.  Scree  graph  for  the  perception  data. 
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Eigenvalue  Number 


Figure  13.2.  Scree  graph  for  the  Seishu  data. 


The  proportions  are  obtained  by  dividing  the  eigenvalues  by  their  sum,  6.63.  Thus 
the  cumulative  proportion  first  exceeds  1 .00  for  m  —  4. 

Method  2  gives  m  =  4,  since  X4  =  1.31  and  I5  =  .61,  where  A 4  and  >.5  are 
eigenvalues  of  R. 

For  method  3,  we  examine  the  scree  plot  in  Figure  13.2.  There  is  a  discernible 
bend  in  slope  at  the  fifth  eigenvalue. 

For  method  4,  we  use  m  =  4  in  the  approximate  chi- squared  statistic  in  (13.47) 
and  obtain  yc  —  9.039,  with  degrees  of  freedom 

v  =  \[(p-m)2  -  p-m\  =  ^[(10  -  4)2  -  10-4]  =  11. 


Since  9.039  <  x  05  n  =  19.68,  we  do  not  reject  the  hypothesis  that  four  factors  are 
adequate. 

Thus  for  the  Seishu  data,  all  four  methods  agree  on  m  —  4.  □ 


13.5  ROTATION 
13.5.1  Introduction 

As  noted  in  Section  13.2.2,  the  factor  loadings  (rows  of  A)  in  the  population  model 
are  unique  only  up  to  multiplication  by  an  orthogonal  matrix  that  rotates  the  loadings. 
The  rotated  loadings  preserve  the  essential  properties  of  the  original  loadings;  they 
reproduce  the  covariance  matrix  and  satisfy  all  basic  assumptions.  The  estimated 
loading  matrix  A  can  likewise  be  rotated  to  obtain  A*  =  AT,  where  T  is  orthogonal. 
Since  TT'  =  I  by  (2.102),  the  rotated  loadings  provide  the  same  estimate  of  the 
covariance  matrix  as  before: 

S  =  A*  A*'  +  iR  =  ATT' A'  +  ^  =  AA'  +  iR. 


(13.48) 
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Geometrically,  the  loadings  in  the  /  th  row  of  A  constitute  the  coordinates  of  a 
point  in  the  loading  space  corresponding  to  y; .  Rotation  of  the  p  points  gives  their 
coordinates  with  respect  to  new  axes  (factors)  but  otherwise  leaves  their  basic  geo¬ 
metric  configuration  intact.  We  hope  to  find  a  new  frame  of  reference  in  which  the 
factors  are  more  interpretable.  To  this  end,  the  goal  of  rotation  is  to  place  the  axes 
close  to  as  many  points  as  possible.  If  there  are  clusters  of  points  (corresponding  to 
groupings  of  y’s),  we  seek  to  move  the  axes  in  order  to  pass  through  or  near  these 
clusters.  This  would  associate  each  group  of  variables  with  a  factor  (axis)  and  make 
interpretation  more  objective.  The  resulting  axes  then  represent  the  natural  factors. 

If  we  can  achieve  a  rotation  in  which  every  point  is  close  to  an  axis,  then  each 
variable  loads  highly  on  the  factor  corresponding  to  the  axis  and  has  small  loadings 
on  the  remaining  factors.  In  this  case,  there  is  no  ambiguity.  Such  a  happy  state  of 
affairs  is  called  simple  structure ,  and  interpretation  is  greatly  simplified.  We  merely 
observe  which  variables  are  associated  with  each  factor,  and  the  factor  is  defined  or 
named  accordingly. 

In  order  to  identify  the  natural  groupings  of  variables,  we  seek  a  rotation  to  an 
interpretable  pattern  for  the  loadings,  in  which  the  variables  load  highly  on  only  one 
factor.  The  number  of  factors  on  which  a  variable  has  moderate  or  high  loadings  is 
called  the  complexity  of  the  variable.  In  the  ideal  situation  referred  to  previously  as 
simple  structure,  the  variables  all  have  a  complexity  of  1 .  In  this  case,  the  variables 
have  been  clearly  clustered  into  groups  corresponding  to  the  factors. 

We  consider  two  basic  types  of  rotation:  orthogonal  and  oblique.  The  rotation  in 
(13.48)  involving  an  orthogonal  matrix  is  an  orthogonal  rotation;  the  original  per¬ 
pendicular  axes  are  rotated  rigidly  and  remain  perpendicular.  In  an  orthogonal  rota¬ 
tion,  angles  and  distances  are  preserved,  communalities  are  unchanged,  and  the  basic 
configuration  of  the  points  remains  the  same.  Only  the  reference  axes  differ.  In  an 
oblique  “rotation”  (transformation),  the  axes  are  not  required  to  remain  perpendicu¬ 
lar  and  are  thus  free  to  pass  closer  to  clusters  of  points. 

In  Sections  13.5.2  and  13.5.3,  we  discuss  orthogonal  and  oblique  rotations,  fol¬ 
lowed  by  some  guidelines  for  interpretation  in  Section  13.5.4. 

13.5.2  Orthogonal  Rotation 

It  was  noted  above  in  Section  13.5.1  that  orthogonal  rotations  preserve  communal¬ 
ities.  This  is  because  the  rows  of  A  are  rotated,  and  the  distance  to  the  origin  is 
unchanged,  which,  by  (13.28),  is  the  communality.  However,  the  variance  accounted 
for  by  each  factor  as  given  in  (13.31)  will  change,  as  will  the  corresponding  pro¬ 
portion  in  (13.32)  or  (13.33).  The  proportions  due  to  the  rotated  loadings  will  not 
necessarily  be  in  descending  order. 

In  Sections  13.5.2a  and  13.5.2b,  we  consider  two  approaches  to  orthogonal  rota¬ 
tion. 

13.5.2a  Graphical  Approach 

If  there  are  only  two  factors  (m  —  2),  we  can  use  a  graphical  rotation  based  on  a 
visual  inspection  of  a  plot  of  factor  loadings.  In  this  case,  the  rows  of  A  are  pairs  of 
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loadings,  (X,i,  I/2),  i  =  1,2,...,  p,  corresponding  to  yi,  yi,  ■  ■  ■  ,  yp.  We  choose 
an  angle  0  through  which  the  axes  can  be  rotated  to  move  them  closer  to  groupings 
of  points.  The  new  rotated  loadings  (X*  ,  X*2)  can  be  measured  directly  on  the  graph 
as  coordinates  of  the  axes  or  calculated  from  A*  =  AT  using 


T  = 


cos  0  —  sin  0 

sin  0  cos  0 


(13.49) 


Example  13.5.2a.  In  Example  13.3.1,  the  initial  factor  loadings  for  the  perception 
data  did  not  provide  an  interpretation  consistent  with  the  two  groupings  of  variables 
apparent  in  the  pattern  of  correlations  in  R.  The  five  pairs  of  loadings  (/,,  | .  k/i) 
corresponding  to  the  five  variables  are  plotted  in  Figure  13.3.  An  orthogonal  rotation 
through  —35°  would  bring  the  axes  (factors)  closer  to  the  two  clusters  of  points 
(variables)  identified  in  Example  13.3.1.  With  the  rotation,  each  cluster  of  variables 
corresponds  much  more  closely  to  a  factor.  Using  A  from  Example  13.3.1  and  —35° 
in  T  as  given  in  (13.49),  we  obtain  the  following  rotated  loadings: 


Figure  13.3.  Plot  of  the  two  loadings  for  each  of  the  five  variables  in  the  perception  data  of 
Table  13.1. 
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A*  =  AT  = 


/ 

.969 

-.231 

.519 

.807 

.785 

-.587 

.971 

-.210 

V 

.704 

.667 

.819 

-.574 


.574  \ 
.819  ) 


/ 

.927 

.367  \ 

-.037 

.959 

.980 

-.031 

.916 

.385 

V 

.194 

.950  / 

In  Table  13.6,  we  compare  the  rotated  loadings  in  A*  with  the  original  loadings  in 
A. 

The  interpretation  of  the  rotated  loadings  is  clear.  As  indicated  by  the  boldface 
loadings  in  Table  13.6,  the  first  factor  is  associated  with  variables  1,  3,  and  4:  kind, 
happy,  and  likeable.  The  second  factor  is  associated  with  the  other  two  variables: 
intelligent  and  just.  This  same  grouping  of  variables  is  indicated  by  the  pattern  in 
the  correlation  matrix  in  (13.35)  and  can  also  be  seen  in  the  two  clusters  of  points  in 
Figure  13.3.  The  first  factor  might  be  described  as  representing  a  person’s  perceived 
humanity  or  amiability,  while  the  second  involves  more  logical  or  rational  practices. 

Note  that  if  the  angle  between  the  rotated  axes  is  allowed  to  be  less  than  90° 
(an  oblique  rotation),  the  lower  axis  representing  /*  could  come  closer  to  the  points 
corresponding  to  variables  1  and  4  so  that  the  coordinates  on  /2*,  .367  and  .385,  could 
be  reduced.  However,  the  basic  interpretation  would  not  change;  variables  1  and  4 
would  still  be  associated  with  /*.  □ 


Table  13.6.  Graphically  Rotated  Loadings  for  the  Perception  Data  of  Table  13.1 


Principal  Component 
Loadings 

Graphically  Rotated 
Loadings 

Communalities, 

Variables 

fi 

fi 

/t 

fi 

h] 

Kind 

.969 

-.231 

.927 

.367 

.993 

Intelligent 

.519 

.807 

-.037 

.959 

.921 

Happy 

.785 

-.587 

.980 

-.031 

.960 

Likeable 

.971 

-.210 

.916 

.385 

.987 

Just 

.704 

.667 

.194 

.950 

.940 

Variance 

accounted  for 

3.263 

1.538 

2.696 

2.106 

4.802 

Proportion  of 

total  variance 

.653 

.308 

.539 

.421 

.960 

Cumulative 

proportion 

.653 

.960 

.539 

.960 

.960 
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13.5.2b  Varimax  Rotation 

The  graphical  approach  to  rotation  is  generally  limited  to  m  =  2.  For  m  >  2,  various 
analytical  methods  have  been  proposed.  The  most  popular  of  these  is  the  varimax 
technique,  which  seeks  rotated  loadings  that  maximize  the  variance  of  the  squared 
loadings  in  each  column  of  A*.  If  the  loadings  in  a  column  were  nearly  equal,  the 
variance  would  be  close  to  0.  As  the  squared  loadings  approach  0  and  1  (for  factoring 
R).  the  variance  will  approach  a  maximum.  Thus  the  varimax  method  attempts  to 
make  the  loadings  either  large  or  small  to  facilitate  interpretation. 

The  varimax  procedure  cannot  guarantee  that  all  variables  will  load  highly  on 
only  one  factor.  In  fact,  no  procedure  could  do  this  for  all  possible  data  sets.  The 
configuration  of  the  points  in  the  loading  space  remains  fixed;  we  merely  rotate  the 
axes  to  be  as  close  to  as  many  points  as  possible.  In  many  cases,  the  points  are 
not  well  clustered,  and  the  axes  simply  cannot  be  rotated  so  as  to  be  near  all  of  them. 
This  problem  is  compounded  by  having  to  choose  m .  If  m  is  changed,  the  coordinates 
(Xj  i ,  k/o. . . .  ,  /  change,  and  the  relative  position  of  the  points  is  altered. 

The  varimax  rotation  is  available  in  virtually  all  factor  analysis  software  pro¬ 
grams.  The  output  typically  includes  the  rotated  loading  matrix  A*,  the  variance 
accounted  for  (sum  of  squares  of  each  column  of  A*),  the  communalities  (sum  of 
squares  of  each  row  of  A*),  and  the  orthogonal  matrix  T  used  to  obtain  A*  =  AT. 

Example  13.5.2b(a).  In  Example  13.5.2a,  a  graphical  rotation  was  devised  visually 
to  achieve  interpretable  loadings  for  the  perception  data  of  Table  13.1.  As  we  would 
expect,  the  varimax  method  yields  a  similar  result.  The  varimax  rotated  loadings 
are  given  in  Table  13.7.  For  comparison,  we  have  included  the  original  unrotated 
loadings  from  Table  13.3  and  the  graphically  rotated  loadings  from  Table  13.6. 


Table  13.7.  Varimax  Rotated  Factor  Loadings  for  the  Perception  Data  of  Table  13.1 


Principal 

Component 

Loadings 

Graphically 

Rotated 

Loadings 

Varimax 

Rotated 

Loadings 

Communalities 

Variables 

fi 

h 

/i 

h 

fi 

h 

hj 

Kind 

.969 

-.231 

.927 

.367 

.951 

.298 

.993 

Intelligent 

.519 

.807 

-.037 

.959 

.033 

.959 

.921 

Happy 

.785 

-.587 

.980 

-.031 

.975 

-.103 

.960 

Likeable 

.971 

-.210 

.916 

.385 

.941 

.317 

.987 

Just 

.704 

.667 

.194 

.950 

.263 

.933 

.940 

Variance 

accounted  for 

3.263 

1.538 

2.696 

2.106 

2.811 

1.991 

4.802 

Proportion  of 

total  variance 

.653 

.308 

.539 

.421 

.562 

.398 

.960 

Cumulative 

proportion 

.653 

.960 

.539 

.960 

.562 

.960 

.960 
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The  orthogonal  matrix  T  for  the  varimax  rotation  is 


f  .859  .512  \ 
-.512  .859  )  ' 


By  (13.49),  —  sirup  —  .512,  and  the  angle  of  rotation  is  given  by  4>  —  —  sin-1  (.5 12)  = 
—30.8°.  Thus  the  varimax  rotation  chose  an  angle  of  rotation  of  —30.8°  as  compared 
to  the  —35°  we  selected  visually,  but  the  results  are  very  close  and  the  interpretation 
is  exactly  the  same.  □ 


Example  13.5.2b(b).  In  Examples  13.3.3  and  13.3.4,  we  obtained  the  iterated  prin¬ 
cipal  factor  loadings  and  maximum  likelihood  loadings  for  the  Seishu  data.  In 
Table  13.8,  we  show  the  varimax  rotation  of  these  two  sets  of  loadings.  The  similar¬ 
ities  in  the  two  sets  of  rotated  loadings  are  striking.  The  interpretation  in  each  case 
is  the  same.  The  variances  accounted  for  are  virtually  identical. 

The  rotation  in  each  case  has  achieved  a  satisfactory  simple  structure  and  most 
variables  show  a  complexity  of  1.  The  boldface  loadings  indicate  the  variables  asso¬ 
ciated  with  each  factor  for  interpretation  purposes.  These  may  be  meaningful  to  the 
researcher.  For  example,  factor  2  is  associated  with  sake-meter,  reducing  sugar,  and 
total  sugar,  whereas  factor  3  is  aligned  with  taste  and  odor.  □ 


13.5.3  Oblique  Rotation 

The  term  oblique  rotation  refers  to  a  transformation  in  which  the  axes  do  not  remain 
perpendicular.  Technically,  the  term  oblique  rotation  is  a  misnomer,  since  rotation 
implies  an  orthogonal  transformation  that  preserves  distances.  A  more  accurate  char- 


Table  13.8.  Varimax  Rotated  Loadings  for  the  Seishu  Data 


Iterated  Principal  Factor 
Rotated  Loadings 

Maximum  Likelihood 
Rotated  Loadings 

Variables 

h 

h 

h 

u 

h 

h 

h 

f* 

Taste 

.16 

-.01 

.99 

-.09 

.16 

-.00 

.98 

-.10 

Odor 

-.11 

.14 

.48 

.14 

-.07 

.14 

.49 

.17 

PH 

.88 

-.12 

.02 

-.13 

.82 

-.10 

.08 

-.15 

Acidity  1 

.26 

-.09 

.09 

.54 

.29 

-.08 

.11 

.53 

Acidity  2 

.89 

-.06 

.10 

.43 

.91 

-.06 

.10 

.39 

Sake-meter 

-.43 

-.76 

.01 

.07 

-.46 

-.80 

.04 

.10 

Reducing  sugar 

-.37 

.76 

.18 

.03 

-.37 

.75 

.20 

.08 

Total  sugar 

-.26 

.92 

.10 

.25 

-.27 

.91 

.11 

.26 

Alcohol 

-.01 

.25 

.00 

.80 

-.00 

.25 

.01 

.81 

Formyl-nitrogen 

.74 

-.07 

-.08 

.20 

.76 

-.07 

-.08 

.22 

Variance 

accounted  for 

2.62 

2.12 

1.27 

1.27 

2.61 

2.14 

1.29 

1.28 
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acterization  would  be  oblique  transformation,  but  the  term  oblique  rotation  is  well 
established  in  the  literature. 

Instead  of  the  orthogonal  transformation  matrix  T  used  in  (13.16),  (13.17),  and 
(13.18),  an  oblique  rotation  uses  a  general  nonsingular  transformation  matrix  Q  to 
obtain  f*  =  Q'f,  and  by  (3.74), 

cov(f)  =  QIQ  =  QQ^I.  (13.50) 

Thus  the  new  factors  are  correlated.  Since  distances  and  angles  are  not  preserved,  the 
communalities  for  f*  are  different  from  those  for  f.  Some  program  packages  report 
communalities  obtained  from  the  original  loadings,  rather  than  the  oblique  loadings. 

When  the  axes  are  not  required  to  be  perpendicular,  they  can  more  easily  pass 
through  the  major  clusters  of  points  in  the  loading  space  (assuming  there  are  such 
clusters).  For  example,  in  Figure  13.4,  we  have  plotted  the  varimax  rotated  loadings 
for  two  factors  extracted  from  the  sons  data  of  Table  3.7  (see  Example  13.5.3  at  the 
end  of  this  section).  Oblique  axes  with  an  angle  of  38°  would  pass  much  closer  to  the 
points,  and  the  resulting  loadings  would  be  very  close  to  0  and  1.  However,  the  inter¬ 
pretation  would  not  change,  since  the  same  points  (variables)  would  be  associated 
with  the  oblique  axes  as  with  the  orthogonal  axes. 

Various  analytical  methods  for  achieving  oblique  rotations  have  been  proposed 
and  are  available  in  program  packages.  Typically,  the  output  of  one  of  these  pro¬ 
cedures  includes  a  pattern  matrix,  a  structure  matrix,  and  a  matrix  of  correlations 
among  the  oblique  factors.  For  interpretation,  we  would  usually  prefer  the  pattern 


Factor  2 


Figure  13.4.  Orthogonal  and  oblique  rotations  for  the  sons  data. 
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matrix  rather  than  the  structure  matrix.  The  loadings  in  a  row  of  the  pattern  matrix 
are  the  natural  coordinates  of  the  point  (variable)  on  the  oblique  axes  and  serve  as 
coefficients  in  the  model  relating  the  variable  to  the  factors. 

One  use  for  an  oblique  rotation  is  to  check  on  the  orthogonality  of  the  factors. 
The  orthogonality  in  the  original  factors  is  imposed  by  the  model  and  maintained  by 
an  orthogonal  rotation.  If  an  oblique  rotation  produces  a  correlation  matrix  that  is 
nearly  diagonal,  we  can  be  more  confident  that  the  factors  are  indeed  orthogonal. 

Example  13.5.3.  The  correlation  matrix  for  the  sons  data  of  Table  3.7  is 


1.000 

.735 

.711 

.704 

.735 

1.000 

.693 

.709 

.711 

.693 

1.000 

.839 

.704 

.709 

.839 

1.000 

The  varimax  rotated  loadings  for  two  factors  obtained  by  the  principal  component 
method  are  given  in  Table  13.9  and  plotted  in  Figure  13.4.  An  analytical  oblique 
rotation  (Harris-Kaiser  orthoblique  method  in  SAS)  produced  oblique  axes  with  an 
angle  of  38°,  the  same  as  obtained  by  a  graphical  approach.  The  correlation  between 
the  two  factors  is  .79  [obtained  from  Q'Q  in  (13.50)],  which  is  related  to  the  angle 
by  (3.15),  .79  =  cos  38°.  The  pattern  loadings  are  given  in  Table  13.9. 

The  oblique  loadings  give  a  much  cleaner  simple  structure  than  the  varimax  load¬ 
ings,  but  the  interpretation  is  essentially  the  same  if  we  neglect  loadings  below  .45 
on  the  varimax  rotation. 

In  Figure  13.4,  it  is  evident  that  a  single  factor  would  be  adequate  since  the  angle 
between  axes  is  less  than  45°.  The  suggestion  to  let  m  =  1  is  also  supported  by  the 
first  three  criteria  in  Section  13.4:  the  eigenvalues  of  R  are  3.20,  .38,  .27,  and  .16.  The 
first  accounts  for  80%;  the  second  for  an  additional  9%.  The  large  correlation,  .79, 
between  the  two  oblique  factors  constitutes  additional  evidence  that  a  single-factor 
model  would  suffice  here.  In  fact,  the  pattern  in  R  itself  indicates  the  presence  of  only 
one  factor.  The  four  variables  form  only  one  cluster,  since  all  are  highly  correlated. 
There  are  no  small  correlations  between  groupings  of  variables.  □ 


Table  13.9.  Varimax  and  Orthoblique  Loadings  for  the 
Sons  Data 


Variable 

Varimax 

Loadings 

Orthoblique 
Pattern  matrix 

/i 

h 

fi 

h 

1 

.42 

.82 

.03 

.90 

2 

.40 

.85 

-.03 

.96 

3 

.87 

.41 

.97 

-.01 

4 

.86 

.43 

.95 

.01 
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13.5.4  Interpretation 

In  Sections  13.5.1,  13.5.2,  and  13.5.3,  we  have  discussed  the  usefulness  of  rotation 
as  an  aid  to  interpretation.  Our  goal  is  to  achieve  a  simple  structure  in  which  each 
variable  loads  highly  on  only  one  factor,  with  small  loadings  on  all  other  factors.  In 
practice,  we  often  fail  to  achieve  this  goal,  but  rotation  usually  produces  loadings 
that  are  closer  to  the  desired  simple  structure. 

We  now  suggest  general  guidelines  for  interpreting  the  factors  by  examination  of 
the  matrix  of  rotated  factor  loadings.  Moving  horizontally  from  left  to  right  across  the 
m  loadings  in  each  row,  identify  the  highest  loading  (in  absolute  value).  If  the  highest 
loading  is  of  a  significant  size  (a  subjective  determination,  see  the  next  paragraph), 
circle  or  underline  it.  This  is  done  for  each  of  the  p  variables.  There  may  be  other 
significant  loadings  in  a  row  besides  the  one  circled.  If  these  are  considered,  the 
interpretation  is  less  simple.  On  the  other  hand,  there  may  be  variables  with  such 
small  communalities  that  no  significant  loading  appears  on  any  factor.  In  this  case, 
the  researcher  may  wish  to  increase  the  number  of  factors  and  run  the  program  again 
so  that  these  variables  might  associate  with  a  new  factor. 

To  assess  significance  of  factor  loadings  obtained  from  R,  a  threshold  value  of 
.3  has  been  advocated  by  many  writers.  For  most  successful  applications,  however, 
a  critical  value  of  .3  is  too  low  and  will  result  in  variables  of  complexity  greater 
than  1.  A  target  value  of  .5  or  .6  is  typically  more  useful.  The  .3  criterion  is  loosely 
based  on  the  critical  value  for  significance  of  an  ordinary  correlation  coefficient,  r. 
However,  the  distribution  of  the  sample  loadings  is  not  the  same  as  that  of  r  arising 
from  the  bivariate  normal.  In  addition,  the  critical  value  should  be  increased  because 
mp  values  of  are  being  tested.  On  the  other  hand,  if  m  is  large,  the  critical  value 
might  possibly  need  to  be  reduced  somewhat.  Since  hj  =  YY'j=\  ^fj 's  bounded  by 

l,  an  increase  in  m  reduces  the  average  squared  loading  in  a  row. 

After  identifying  potentially  significant  loadings,  the  experimenter  then  attempts 
to  discover  some  meaning  in  the  factors  and,  ideally,  to  label  or  name  them.  This  can 
readily  be  done  if  the  group  of  variables  associated  with  each  factor  makes  sense  to 
the  researcher.  But  in  many  situations,  the  groupings  are  not  so  logical,  and  a  revision 
can  be  tried,  such  as  adjusting  the  size  of  loading  deemed  to  be  important,  changing 

m,  using  a  different  method  of  estimating  the  loadings,  or  employing  another  type  of 
rotation. 


13.6  FACTOR  SCORES 

In  many  applications,  the  researcher  wishes  only  to  ascertain  whether  a  factor  anal¬ 
ysis  model  fits  the  data  and  to  identify  the  factors.  In  other  applications,  the  exper¬ 
imenter  wishes  to  obtain  factor  scores,  f,  =  (fn,  fn,  ■  ■  ■  ,  fimY,  i  =  1,2,...,  n, 
which  are  defined  as  estimates  of  the  underlying  factor  values  for  each  observation. 
There  are  two  potential  uses  for  such  scores:  (1)  the  behavior  of  the  observations  in 
terms  of  the  factors  may  be  of  interest  and  (2)  we  may  wish  to  use  the  factor  scores 
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as  input  to  another  analysis,  such  as  MANOVA.  The  latter  usage  resembles  a  similar 
application  of  principal  components. 

Since  the  f’s  are  not  observed,  we  must  estimate  them  as  functions  of  the 
observed  y’s.  The  most  popular  approach  to  estimating  the  factors  is  based  on 
regression  (Thomson  1951).  We  will  discuss  this  method  and  also  briefly  describe  an 
informal  technique  that  can  be  used  when  R  (or  S)  is  singular.  For  other  approaches 
see  Harman  (1976,  Chapter  16). 

Since  E(fj)  =  0,  we  relate  the  f’s  to  the  v’s  by  a  centered  regression  model 


/t  =  Puiyi  -  >r)  +  Pvlyi  -  yf)  h - i-  PiP(yP  -  yP )  +  o, 

h  -  Pnfiyi  -  yi)  +  Puiyi  -yf)-\ - 1-  Pip{yp  -  yp )  +  <?2, 

fm  —  Pm\iy\  y i )  T  (J)m2 (.V2  yf)  T  ■  ■  ■  T  filnp ( y p  y p )  ~t~  ^ m , 
which  can  be  written  in  matrix  form  as 

f  =  B'1(y  — y)  +  e. 

We  have  used  the  notation  e  to  distinguish  this  error  from  e  in  the  original  factor 
model  y  —  ju,  =  Af  +  e  given  in  (13.3).  Our  approach  is  to  estimate  Bi  and  use  the 
predicted  value  f  =  Bj  (y  —  y)  to  estimate  f. 

The  model  (13.52)  holds  for  each  observation: 

f i  =  Bj  (y,-  -  y)  +  e, .  i  =  1, 2, . . .  ,  n. 

In  transposed  form,  the  model  becomes 

f;  =  (y;-y)'Bi+e',  i  =  1,2,...  ,n, 


(13.51) 


(13.52) 


and  these  n  equations  can  be  combined  into  a  single  model, 


F  = 


f2 

t  i  > 

/  (yi  -  y)'Bi  \ 
(y2  -  y)'Bi 

V  (y„-y)'Bi  j 

(  (yi-y)'  \ 
(y2  -  y)' 


f  e' 

+ 

B!+H 


V  (y»  -  y )'  J 

=  YfBj  +  g 


[by  (10.11)]. 


(13.53) 
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The  model  (13.53)  has  the  appearance  of  a  centered  multivariate  multiple  regres¬ 
sion  model  as  in  Section  10.4.5,  with  Yf  in  place  of  Xr.  By  (10.50),  the  estimate  for 
B  i  would  be 


Bi  =  (Y'cYcrlY'cF. 


(13.54) 


However,  F  is  unobserved.  To  evaluate  Bi  in  spite  of  this,  we  first  use  (10.52)  to 
rewrite  (13.54)  in  terms  of  covariance  matrices, 

Bi  =  S“ !  Syf .  (13.55) 

In  the  notation  of  the  present  chapter,  Sy  v  is  represented  by  S;  for  S yf  we  use  A,  since 
A  estimates  cov(y,  f)  =  A  in  (13.13).  Thus,  based  on  the  assumptions  in  Section 
13.2.1,  we  can  write  (13.55)  as 


Bi  =  S-1A. 


(13.56) 


Then  from  model  (13.53),  the  estimated  (predicted)  f,  values  are  given  by 


/  f't  \ 

% 

v  K  / 


YcBi 


YCS_1A. 


If  R  is  factored  instead  of  S,  (13.56)  and  (13.57)  become 

Bj  =  R-1A, 

F  =  YjR_1A, 


(13.57) 


(13.58) 

(13.59) 


respectively,  where  Yv  is  the  observed  matrix  of  standardized  variables,  ( yij—yj)/sj . 

We  would  ordinarily  obtain  factor  scores  for  the  rotated  factors  rather  than  the 
original  factors.  Thus  A  in  (13.57)  or  (13.59)  would  be  replaced  by  A*. 

In  order  to  obtain  factor  scores  by  (13.57)  or  (13.59),  S  or  R  must  be  nonsin¬ 
gular.  When  R  (or  S)  is  singular,  we  can  obtain  factor  scores  by  a  simple  method 
based  directly  on  the  rotated  loadings.  We  cluster  the  variables  into  groups  (factors) 
according  to  the  loadings  and  find  a  score  for  each  factor  by  averaging  the  vari¬ 
ables  associated  with  the  factor.  If  the  variables  are  not  commensurate,  the  variables 
should  be  standardized  before  averaging.  An  alternative  approach  would  be  to  weight 
the  variables  by  their  loadings  when  averaging. 
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Example  13.6.  The  speaking  rate  of  four  voices  was  artificially  manipulated  by 
means  of  a  rate  changer  without  altering  the  pitch  (Brown,  Strong,  and  Rencher 
1973).  There  were  five  rates  for  each  voice: 

FF  =  45%  faster, 

F  =  25%  faster, 

N  =  normal  rate, 

S  =  22%  slower, 

SS  =  42%  slower. 

The  resulting  20  voices  were  played  to  30  judges,  who  rated  them  on  15  paired- 
opposite  adjectives  (variables)  with  a  14-point  scale  between  poles.  The  following 
adjectives  were  used:  intelligent,  ambitious,  polite,  active,  confident,  happy,  just, 
likeable,  kind,  sincere,  dependable,  religious,  good-looking,  sociable,  and  strong. 
The  results  were  averaged  over  the  30  judges  to  produce  20  observation  vectors  of 
15  variables  each.  The  averaging  produced  very  reliable  data,  so  that  even  though 
there  were  only  20  observations  on  15  variables,  the  factor  analysis  model  fit  very 
well.  The  correlation  matrix  is  as  follows: 


/  1.00 

.90 

-.17 

.88 

.92 

.88 

.15 

.39 

-.02 

-.16 
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-.15 
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-.78 
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.20 

.39 

-.09 

-.16 

.49 

-.10 
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.90 
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.33 
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\  .73 
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.81 

.78 
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.29 
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1.00/ 

The  eigenvalues  of  R  are  7.91,  5.85,  .31,  .26,  . . .  ,  .002,  with  the  scree  plot  in  Fig¬ 
ure  13.5.  Clearly,  by  any  criterion  for  choosing  m,  there  are  two  factors. 

All  four  major  methods  of  factor  extraction  discussed  in  Section  13.3  produced 
nearly  identical  results  (after  rotation).  We  give  the  initial  and  rotated  loadings 
obtained  from  the  principal  component  method  in  Table  13.10. 

The  two  rotated  factors  were  labeled  competence  and  benevolence.  The  same  two 
factors  emerged  consistently  in  similar  studies  with  different  voices  and  different 
judges. 

The  two  groupings  of  variables  can  also  be  seen  in  the  correlation  matrix.  For 
example,  in  the  first  row,  the  large  correlations  correspond  to  the  boldface  rotated 
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Figure  13.5.  Scree  graph  for  voice  data. 


Table  13.10.  Initial  and  Varimax  Rotated  Loadings  for  the  Voice  Data 


Variable 

Initial  Loadings 

7i  h 

Rotated  Loadings 

fi  h 

Communalities 

Intelligent 

.71 

-.65 

.96 

-.06 

.93 

Ambitious 

.48 

-.84 

.90 

-.36 

.94 

Polite 

.50 

.81 

-.12 

.95 

.92 

Active 

.37 

-.91 

.86 

-.48 

.97 

Confident 

.73 

-.64 

.97 

-.04 

.95 

Happy 

.83 

-.47 

.94 

.15 

.91 

Just 

.71 

.58 

.20 

.89 

.84 

Likeable 

.89 

.39 

.45 

.87 

.95 

Kind 

.58 

.75 

-.02 

.95 

.89 

Sincere 

.52 

.82 

-.11 

.97 

.95 

Dependable 

.93 

.27 

.56 

.79 

.94 

Religious 

.55 

.79 

-.07 

.96 

.92 

Good  looking 

.91 

-.29 

.89 

.35 

.91 

Sociable 

.91 

-.22 

.84 

.40 

.87 

Strong 

.91 

-.21 

.84 

.41 

.86 

Variance 

7.91 

5.85 

7.11 

6.65 

13.76 

accounted  for 

Proportion  of 

.53 

.39 

.47 

.44 

.92 

total  variance 

Cumulative 

.53 

.92 

.47 

.92 

.92 

proportion 
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Figure  13.6.  Factor  scores  of  adjective  rating  of  voices  with  five  levels  of  manipulated  rate. 

loadings  for  f\,  whereas  in  the  third  row,  the  large  correlations  correspond  to  the 
boldface  rotated  loadings  for  f%. 

The  factor  scores  were  of  primary  interest  in  this  study.  The  goal  was  to  ascertain 
the  effect  of  the  rate  manipulations  on  the  two  factors,  that  is,  to  determine  the  per¬ 
ceived  change  in  competence  and  benevolence  when  the  speaking  rate  is  increased 
or  decreased. 

The  two  factor  scores  were  obtained  for  each  of  the  20  voices;  these  are  plotted 
in  Figure  13.6,  where  a  consistent  effect  of  the  manipulation  of  speaking  rate  on 
all  four  voices  can  clearly  be  seen.  Decreasing  the  speaking  rate  causes  the  speaker 
to  be  rated  less  competent;  increasing  the  rate  causes  the  speaker  to  be  rated  less 
benevolent.  The  mean  vectors  (centroids)  are  also  given  in  Figure  13.6  for  the  four 
speakers.  □ 

13.7  VALIDITY  OF  THE  FACTOR  ANALYSIS  MODEL 

For  many  statisticians,  factor  analysis  is  controversial  and  does  not  belong  in  a  toolkit 
of  legitimate  multivariate  techniques.  The  reasons  for  this  mistrust  include  the  fol¬ 
lowing:  the  difficulty  in  choosing  m,  the  many  methods  of  extracting  factors,  the 
many  rotation  techniques,  and  the  subjectivity  in  interpretation.  Some  statisticians 
also  criticize  factor  analysis  because  of  the  indeterminacy  of  the  factor  loading  matrix 
A  or  A,  first  noted  in  Section  13.2.2.  However,  it  is  the  ability  to  rotate  that  gives  fac¬ 
tor  analysis  its  utility,  if  not  its  charm. 
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The  basic  question  is  whether  the  factors  really  exist.  The  model  (13.11)  for  the 
covariance  matrix  is  2  =  AA7  +  Tl'  or  2  —  =  A  A7.  where  A  A'  is  of  rank  m. 

Many  populations  have  covariance  matrices  that  do  not  approach  this  pattern  unless 
m  is  large.  Thus  the  model  will  not  fit  data  from  such  a  population  when  we  try 
to  impose  a  small  value  of  m.  On  the  other  hand,  for  a  population  in  which  2  is 
reasonably  close  to  A  A7  +  for  small  m,  the  sampling  procedure  leading  to  S  may 
obscure  this  pattern.  The  researcher  may  believe  there  are  underlying  factors  but  has 
difficulty  collecting  data  that  will  reveal  them.  In  many  cases,  the  basic  problem  is 
that  S  (or  R)  contains  both  structure  and  error,  and  the  methods  of  factor  analysis 
cannot  separate  the  two. 

A  statistical  consultant  in  a  university  setting  or  elsewhere  all  too  often  sees  the 
following  scenario.  A  researcher  designs  a  long  questionnaire,  with  answers  to  be 
given  in,  say,  a  five-point  semantic  differential  scale  or  Likert  scale.  The  respon¬ 
dents,  who  vary  in  attitude  from  uninterested  to  resentful,  hurriedly  mark  answers 
that  in  many  cases  are  not  even  good  subjective  responses  to  the  questions.  Then  the 
researcher  submits  the  results  to  a  handy  factor  analysis  program.  Being  disappointed 
in  the  results,  he  or  she  appeals  to  a  statistician  for  help.  They  attempt  to  improve  the 
results  by  trying  different  methods  of  extraction,  different  rotations,  different  values 
of  m,  and  so  on.  But  it  is  all  to  no  avail.  The  scree  plot  looks  more  like  the  foothills 
than  a  steep  cliff  with  gently  sloping  debris  at  the  bottom.  There  is  no  clear  value 
of  m.  They  have  to  extract  10  or  12  factors  to  account  for,  say,  60%  of  the  variance, 
and  interpretation  of  this  large  number  of  factors  is  hopeless.  If  a  few  underlying 
dimensions  exist,  they  are  totally  obscured  by  both  systematic  and  random  errors  in 
marking  the  questionnaire.  A  factor  analysis  model  simply  does  not  fit  such  a  data 
set,  unless  a  large  value  of  m  is  used,  which  gives  useless  results. 

It  is  not  necessarily  the  “discreteness”  of  the  data  that  causes  the  problem,  but 
the  “noisiness”  of  the  data.  The  specified  variables  are  not  measured  accurately.  In 
some  cases,  discrete  variables  yield  satisfactory  results,  such  as  in  Examples  13.3.1, 
13.3.2,  13.5.2a,  and  13.5.2b(a),  where  a  12-year-old  girl,  responding  carefully  to  a 
semantic  differential  scale,  produced  data  leading  to  an  unambiguous  factor  analysis. 
On  the  other  hand,  continuous  variables  do  not  guarantee  good  results  [see  Example 
13.7(a)], 

In  cases  in  which  some  factors  are  found  that  provide  a  satisfactory  fit  to  the 
data,  we  should  still  be  tentative  in  interpretation  until  we  can  independently  estab¬ 
lish  the  existence  of  the  factors.  If  the  same  factors  emerge  in  repeated  sampling 
from  the  same  population  or  a  similar  one,  then  we  can  have  confidence  that  appli¬ 
cation  of  the  model  has  uncovered  some  real  factors.  Thus  it  is  good  practice  to 
repeat  the  experiment  to  check  the  stability  of  the  factors.  If  the  data  set  is  large 
enough,  it  could  be  split  in  half  and  a  factor  analysis  performed  on  each  half.  The 
two  solutions  could  be  compared  with  each  other  and  with  the  solution  for  the  com¬ 
plete  set. 

If  there  is  replication  in  the  data  set,  it  may  be  helpful  to  average  over  the  repli¬ 
cations.  This  was  done  to  great  advantage  in  Example  13.6,  where  several  judges 
rated  the  same  voices.  Averaging  over  the  judges  produced  variables  that  apparently 
possessed  very  low  noise.  Similar  experimentation  with  different  judges  always  pro- 
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duced  the  same  factors.  Unfortunately,  replication  of  this  type  is  unavailable  in  most 
situations. 

As  with  other  techniques  in  this  book,  factor  analysis  assumes  that  the  variables 
are  at  least  approximately  linearly  related  to  each  other.  We  could  make  bivariate 
scatter  plots  to  check  this  assumption. 

A  basic  prerequisite  for  a  factor  analysis  application  is  that  the  variables  not  be 
independent.  To  check  this  requirement,  we  could  test  Hq  :  =  I  by  using  the  test 

in  Section  7.4.3. 

Some  writers  have  suggested  that  R  1  should  be  a  near-diagonal  matrix  in  order 
to  successfully  fit  a  factor  analysis  model.  To  assess  how  close  R-1  is  to  a  diagonal 
matrix,  Kaiser  (1970)  proposed  a  measure  of  sampling  adequacy, 

y\  .  ,-2 

MSA  =  — - - -  (13.60) 

>ij  +  E,w  ‘if 

where  rfj  is  the  square  of  an  element  from  R  and  qjj  is  the  square  of  an  element 
from  Q  =  DR-'D,  with  D  =  [(diagR"1)1/2]-*.  As  R-1  approaches  a  diago¬ 
nal  matrix,  MSA  approaches  1.  Kaiser  and  Rice  (1974)  suggest  that  MSA  should 
exceed  .8  for  satisfactory  results  to  be  expected.  We  show  some  results  for  MSA  in 
Example  13.7(b). 

In  summary,  there  are  many  data  sets  to  which  factor  analysis  should  not  be 
applied.  One  indication  that  R  is  inappropriate  for  factoring  is  the  failure  of  the 
methods  in  Section  13.4  to  clearly  and  rather  objectively  choose  a  value  for  m.  If 
the  scree  plot  does  not  have  a  pronounced  bend  or  the  eigenvalues  do  not  show  a 
large  gap  around  1,  then  R  is  likely  to  be  unsuitable  for  factoring.  In  addition,  the 
communality  estimates  after  factoring  should  be  fairly  large. 

To  balance  the  “good”  examples  in  this  chapter,  we  now  give  an  example  involv¬ 
ing  a  data  set  that  cannot  be  successfully  modeled  by  factor  analysis.  Likewise,  the 
problems  at  the  end  of  the  chapter  include  both  “good”  and  “bad”  data  sets. 

Example  13.7(a).  As  an  illustration  of  an  application  of  factor  analysis  that  is  less 
successful  than  previous  examples  in  this  chapter,  we  consider  the  diabetes  data  of 
Table  3.6.  The  correlation  matrix  for  the  five  variables  is  as  follows: 


/  1.00 

.05 

-.13 

.07 

■2l\ 

.05 

1.00 

-0.1 

.01 

-.10 

-.13 

-.01 

1.00 

.29 

.05 

.07 

.01 

.29 

1.00 

.21 

l  -21 

-.10 

.05 

.21 

1.00,/ 

The  correlations  are  all  small  and  the  variables  do  not  appear  to  have  much  in 
common.  The  MSA  value  is  .49.  The  eigenvalues  are  1.40,  1.21,  1.04,  .71,  and  .65. 
Three  factors  would  be  required  to  account  for  73%  of  the  variance  and  four  factors 
to  reach  87%.  This  is  not  a  useful  reduction  in  dimensionality.  The  eigenvalues  are 
plotted  in  a  scree  graph  in  Figure  13.7.  The  lack  of  a  clear  value  of  m  is  apparent. 
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Eigenvalue  Number 

Figure  13.7.  Scree  graph  for  diabetes  data. 


It  is  evident  from  the  small  correlations  in  R  that  the  communalities  of  the  vari¬ 
ables  will  not  be  large.  The  principal  component  method,  which  essentially  estimates 
the  initial  communalities  as  1 ,  gave  very  different  final  communality  estimates  than 
did  the  iterated  principal  factor  method: 


Communalities 

Principal  component  method  .71  .91  .71  .67  .64 

Iterated  principal  factor  method  .31  .16  .35  .37  .33 

The  communalities  obtained  by  the  iterated  approach  reflect  more  accurately  the 
small  correlations  among  the  variables. 

The  varimax  rotated  factor  loadings  for  three  factors  extracted  by  the  iterated 
principal  factor  method  are  given  in  Table  13.11.  The  first  factor  is  associated  with 
variables  3  and  4,  the  second  factor  with  variables  1  and  5,  and  the  third  with  variable 


Table  13.11.  Varimax  Rotated  Factor  Loadings  for  Iterated  Principal  Factors  from  the 
Diabetes  Data 


Variable 

fi 

Rotated  Loadings 

fl 

h 

Communalities 

1 

-.08 

.54 

.12 

.31 

2 

.01 

.01 

.40 

.16 

3 

.57 

-.15 

-.03 

.35 

4 

.57 

.22 

.02 

.37 

5 

.19 

.47 

-.27 

.33 

Variance 

accounted  for 

.69 

.59 

.24 

1.52 
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2.  This  clustering  of  variables  can  be  seen  in  R,  where  variables  1  and  5  have  a 
correlation  of  .21,  variables  3  and  4  have  a  correlation  of  .29,  and  variable  2  has 
very  low  correlations  with  all  other  variables.  However,  these  correlations  (.21  and 
.29)  are  small,  and  in  this  case  the  collapsing  of  five  variables  to  three  factors  is 
not  a  useful  reduction  in  dimensionality,  especially  since  the  first  three  eigenvalues 
account  for  only  73%  of  tr(R).  The  73%  is  not  convincingly  greater  than  60%,  which 
we  would  expect  from  three  original  variables  picked  at  random.  This  conclusion  is 
borne  out  by  a  test  of  Hq:  Pp  =  I.  Using  (7.37)  and  (7.38),  we  obtain 

u  =  |R|  =  .80276,  v  =  20  -  1  =  19,  p  =  5, 

u'  =  -[v  -  ±(2/i  +  5)]  Inn  =  -(19-  ±±)(-,2197)  =  3.625. 

With  jp(p  —  1)  =  10  degrees  of  freedom,  the  .05  critical  value  for  this  approxi¬ 
mate  /2-test  is  18.31,  and  we  have  no  basis  to  question  the  independence  of  the  five 
variables.  Thus  the  three  factors  we  obtained  are  very  likely  an  artifact  of  the  present 
sample  and  would  not  reappear  in  another  sample  from  the  same  population.  □ 

Example  13.7(b).  For  data  sets  used  in  previous  examples  in  this  chapter,  the  values 
of  MSA  from  (13.60)  are  calculated  as  follows: 


Seishu  data: 

MSA  = 

.53, 

Sons  data: 

MSA  = 

.82, 

Voice  data: 

MSA  = 

.73, 

Diabetes  data: 

MSA  = 

.49. 

The  MSA  value  cannot  be  computed  for  the  perception  data,  because  R  is  singular. 

These  results  do  not  suggest  great  confidence  in  the  MSA  index  as  a  sole  guide 
to  the  suitability  of  R  for  factoring.  We  see  a  wide  disparity  in  the  MSA  values  for 
the  first  three  data  sets.  Yet  all  three  yielded  successful  factor  analyses.  These  three 
MSA  values  seem  to  be  inversely  related  to  the  number  of  factors:  In  the  sons  data, 
there  were  indications  that  one  factor  would  suffice;  the  voice  data  clearly  had  two 
factors;  and  for  the  Seishu  data,  there  were  four  factors. 

The  MSA  for  the  diabetes  data  is  close  to  that  of  the  Seishu  data.  Yet  the  dia¬ 
betes  data  are  totally  unsuitable  for  factor  analysis,  whereas  the  factor  analysis  of  the 
Seishu  data  is  very  convincing.  □ 


13.8  THE  RELATIONSHIP  OF  FACTOR  ANALYSIS  TO 
PRINCIPAL  COMPONENT  ANALYSIS 

Both  factor  analysis  and  principal  component  analysis  have  the  goal  of  reducing 
dimensionality.  Because  the  objectives  are  similar,  many  authors  discuss  principal 
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component  analysis  as  another  type  of  factor  analysis.  This  can  be  confusing,  and 
we  wish  to  underscore  the  distinguishing  characteristics  of  the  two  techniques. 

Two  of  the  differences  between  factor  analysis  and  principal  component  analysis 
were  mentioned  in  Section  13.1:  (1)  In  factor  analysis,  the  variables  are  expressed 
as  linear  combinations  of  the  factors,  whereas  the  principal  components  are  linear 
functions  of  the  variables,  and  (2)  in  principal  component  analysis,  the  emphasis  is 
on  explaining  the  total  variance  .  su ,  as  contrasted  with  the  attempt  to  explain  the 
covariances  in  factor  analysis. 

Additional  differences  are  that  (3)  principal  component  analysis  requires  essen¬ 
tially  no  assumptions,  whereas  factor  analysis  makes  several  key  assumptions;  (4)  the 
principal  components  are  unique  (assuming  distinct  eigenvalues  of  S),  whereas  the 
factors  are  subject  to  an  arbitrary  rotation;  and  (5)  if  we  change  the  number  of  factors, 
the  (estimated)  factors  change.  This  does  not  happen  in  principal  components. 

The  ability  to  rotate  to  improve  interpretability  is  one  of  the  advantages  of  factor 
analysis  over  principal  components.  If  finding  and  describing  some  underlying  fac¬ 
tors  is  the  goal,  factor  analysis  may  prove  more  useful  than  principal  components; 
we  would  prefer  factor  analysis  if  the  factor  model  fits  the  data  well  and  we  like 
the  interpretation  of  the  rotated  factors.  On  the  other  hand,  if  we  wish  to  define  a 
smaller  number  of  variables  for  input  into  another  analysis,  we  would  ordinarily  pre¬ 
fer  principal  components,  although  this  can  sometimes  be  accomplished  with  factor 
scores.  Occasionally,  principal  components  are  interpretable,  as  in  the  size  and  shape 
components  in  Example  12.8.1. 


PROBLEMS 

13.1  Show  that  the  assumptions  lead  to  (13.2),  var(y,)  =  +  Xj2  -t - f  A.?m  +  t/r,- . 

13.2  Verify  directly  that  cov(y,  f)  =  A  as  in  (13.13). 

13.3  Show  that  f*  =  T'f  in  (13.18)  satisfies  the  assumptions  (13.6)  and  (13.7), 
E(f*)  =  0  and  cov(f*)  =  I. 

13.4  Show  that  J^ij  efj  <  +  C+2  - f  Oj  as  in  (13.34),  where  the  e,/s 

are  the  elements  of  E  =  S  —  (AA'  +  M')  and  the  0,-  ’s  are  eigenvalues  of  S. 

13.5  Show  that  i  Yl"j=  l  'V,  is  equal  to  the  sum  of  the  first  m  eigenvalues  and 
also  equal  to  the  sum  of  all  p  communalities,  as  in  (13.46). 

13.6  In  Example  13.3.2,  the  correlation  matrix  for  the  perception  data  was  shown 
to  have  an  eigenvalue  equal  to  0.  Find  the  multicollinearity  among  the  five 
variables  that  this  implies. 

13.7  Use  the  words  data  of  Table  5.9. 

(a)  Obtain  principal  component  loadings  for  two  factors. 

(b)  Do  a  graphical  rotation  of  the  two  factors. 

(c)  Do  a  varimax  rotation  and  compare  the  results  with  those  in  part  (b). 
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13.8  Use  the  ramus  bone  data  of  Table  3.6. 

(a)  Extract  loadings  by  the  principal  component  method  and  do  a  varimax 
rotation.  Use  two  factors. 

(b)  Do  all  variables  have  a  complexity  of  1  ?  Carry  out  an  oblique  rotation  to 
improve  the  loadings. 

(c)  What  is  the  angle  between  the  oblique  axes?  Would  a  single  factor  (m  = 
1)  be  more  appropriate  here? 

13.9  Carry  out  a  factor  analysis  of  the  rootstock  data  of  Table  6.2.  Combine  the  six 

groups  into  a  single  sample. 

(a)  Estimate  the  loadings  for  two  factors  by  the  principal  component  method 
and  do  a  varimax  rotation. 

(b)  Did  the  rotation  improve  the  loadings? 

13.10  Use  the  fish  data  of  Table  6. 17.  Combine  the  three  groups  into  a  single  sample. 

(a)  Obtain  loadings  on  two  factors  by  the  principal  component  method  and 
do  a  varimax  rotation. 

(b)  Notice  the  similarity  of  loadings  for  yj  and  yi.  Is  there  any  indication  in 
the  correlation  matrix  as  to  why  this  is  so? 

(c)  Compute  factor  scores. 

(d)  Using  the  factor  scores,  carry  out  a  MANOVA  comparing  the  three 
groups. 

13.11  Carry  out  a  factor  analysis  of  the  flea  data  in  Table  5.5.  Combine  the  two 

groups  into  a  single  sample. 

(a)  From  an  examination  of  the  eigenvalues  greater  than  1 ,  the  scree  plot,  and 
the  percentages,  is  there  a  clear  choice  of  ml 

(b)  Extract  two  factors  by  the  principal  component  method  and  carry  out  a 
varimax  rotation. 

(c)  Is  the  rotation  an  improvement?  Try  an  oblique  rotation. 

13.12  Use  the  engineer  data  of  Table  5.6.  Combine  the  two  groups  into  a  single 

sample. 

(a)  Using  a  scree  plot,  the  number  of  eigenvalues  greater  than  1,  and  the 
percentages;  is  there  a  clear  choice  of  ml 

(b)  Extract  three  factors  by  the  principal  component  method  and  carry  out  a 
varimax  rotation. 

(c)  Extract  three  factors  by  the  principal  factor  method  and  carry  out  a  vari¬ 
max  rotation. 

(d)  Compare  the  results  of  parts  (b)  and  (c). 
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13.13  Use  the  probe  word  data  of  Table  3.5. 

(a)  Obtain  loadings  for  two  factors  by  the  principal  component  method  and 
carry  out  a  varimax  rotation. 

(b)  Notice  the  near  duplication  of  loadings  for  yi  and  yy.  Is  there  any  indica¬ 
tion  in  the  correlation  matrix  as  to  why  this  is  so? 

(c)  Is  the  rotation  satisfactory?  Try  an  oblique  rotation. 


CHAPTER  14 


Cluster  Analysis 


14.1  INTRODUCTION 

In  cluster  analysis  we  search  for  patterns  in  a  data  set  by  grouping  the  (multivariate) 
observations  into  clusters.  The  goal  is  to  find  an  optimal  grouping  for  which  the 
observations  or  objects  within  each  cluster  are  similar,  but  the  clusters  are  dissimilar 
to  each  other.  We  hope  to  find  the  natural  groupings  in  the  data,  groupings  that  make 
sense  to  the  researcher. 

Cluster  analysis  differs  fundamentally  from  classification  analysis  (Chapter  9).  In 
classification  analysis,  we  allocate  the  observations  to  a  known  number  of  prede¬ 
fined  groups  or  populations.  In  cluster  analysis,  neither  the  number  of  groups  nor  the 
groups  themselves  are  known  in  advance. 

To  group  the  observations  into  clusters,  many  techniques  begin  with  similarities 
between  all  pairs  of  observations.  In  many  cases  the  similarities  are  based  on  some 
measure  of  distance.  Other  cluster  methods  use  a  preliminary  choice  for  cluster  cen¬ 
ters  or  a  comparison  of  within-  and  between-cluster  variability.  It  is  also  possible  to 
cluster  the  variables,  in  which  case  the  similarity  could  be  a  correlation;  see  Sec¬ 
tion  14.7. 

We  can  search  for  clusters  graphically  by  plotting  the  observations.  If  there  are 
only  two  variables  ( p  =  2),  we  can  be  do  this  in  a  scatter  plot  (see  Section  3.3). 
For  p  >  2,  we  can  plot  the  data  in  two  dimensions  using  principal  components  (see 
Section  12.4)  or  biplots  (see  Section  15.3).  For  an  example  of  a  principal  component 
plot,  see  Figure  12.7  in  Section  12.4,  in  which  four  clear  groupings  of  points  can 
be  observed.  Another  approach  to  plotting  is  provided  by  projection  pursuit ,  which 
seeks  two-dimensional  projections  that  reveal  clusters  [see  Friedman  and  Tukey 
(1974);  Huber  (1985);  Sibson  (1984);  Jones  and  Sibson  (1987);  Yenyukov  (1988); 
Posse  (1990);  Nason  (1995);  Ripley  ( 1 996,  pp.  296-303)] . 

Cluster  analysis  has  also  been  referred  to  as  classification,  pattern  recognition 
(specifically,  unsupervised  learning),  and  numerical  taxonomy.  The  techniques 
of  cluster  analysis  have  been  extensively  applied  to  data  in  many  fields,  such  as 
medicine,  psychiatry,  sociology,  criminology,  anthropology,  archaeology,  geology, 
geography,  remote  sensing,  market  research,  economics,  and  engineering. 
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We  shall  concentrate  largely  on  quantitative  variables  [for  categorical  variables, 
see  Gordon  (1999)  or  Everitt  (1993)].  The  data  matrix  [see  (3.17)]  can  be  written  as 


Y  = 


H}  \ 

y. 2 


V  y'n  7 


(y(i>.  y(2).  •  -  -  ,y(P)), 


(14.1) 


where  y'-  is  a  row  (observation  vector)  and  y(  j)  is  a  column  (corresponding  to  a 
variable).  We  generally  wish  to  group  the  n  yj’s  (rows)  into  g  clusters.  We  may  also 
wish  to  cluster  the  columns  y^-j,  j  —  1,2,...  ,  p  (see  Section  14.7). 

Two  common  approaches  to  clustering  the  observation  vectors  are  hierarchical 
clustering  and  partitioning.  In  hierarchical  clustering  we  typically  start  with  n  clus¬ 
ters,  one  for  each  observation,  and  end  with  a  single  cluster  containing  all  n  obser¬ 
vations.  At  each  step,  an  observation  or  a  cluster  of  observations  is  absorbed  into 
another  cluster.  We  can  also  reverse  this  process,  that  is,  start  with  a  single  cluster 
containing  all  n  observations  and  end  with  n  clusters  of  a  single  item  each  (see  Sec¬ 
tion  14.3.10).  In  partitioning,  we  simply  divide  the  observations  into  g  clusters.  This 
can  be  done  by  starting  with  an  initial  partitioning  or  with  cluster  centers  and  then 
reallocating  the  observations  according  to  some  optimality  criterion.  Other  cluster¬ 
ing  methods  that  we  will  discuss  are  based  on  fitting  mixtures  of  multivariate  normal 
distributions  or  searching  for  regions  of  high  density  sometimes  called  modes. 

There  is  an  abundant  literature  on  cluster  analysis.  Useful  monographs  and 
reviews  have  been  given  by  Gordon  (1999),  Everitt  (1993),  Khattree  and  Naik 
(2000,  Chapter  6),  Kaufman  and  Rousseuw  (1990),  Seber  (1984,  Chapter  7),  Ander- 
berg  (1973),  and  Hartigan  (1975a). 


14.2  MEASURES  OF  SIMILARITY  OR  DISSIMILARITY 

Since  cluster  analysis  attempts  to  identify  the  observation  vectors  that  are  similar 
and  group  them  into  clusters,  many  techniques  use  an  index  of  similarity  or  proximity 
between  each  pair  of  observations.  A  convenient  measure  of  proximity  is  the  distance 
between  two  observations.  Since  a  distance  increases  as  two  units  become  further 
apart,  distance  is  actually  a  measure  of  dissimilarity. 

A  common  distance  function  is  the  Euclidean  distance  between  two  vectors  x  = 
(x\,X2, ...  ,  xpY  and  y  =  (yi ,  yi, . . .  ,  yp)',  defined  as 


p 

d(x,  y)  =  y/(x  —  y)'(x  —  y)  =  ^(xj  -  yj)1.  (14.2) 

N i=l 

To  adjust  for  differing  variances  and  covariances  among  the  p  variables,  we  could 
use  the  statistical  distance 
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d(x,  y)  =  y (x  —  y)'S_  1  (x  —  y)  (14.3) 

[see  (3.79)],  where  S  is  the  sample  covariance  matrix.  After  the  clusters  are  formed, 
S  could  be  computed  as  the  pooled  within-cluster  covariance  matrix,  but  we  do  not 
know  beforehand  what  the  clusters  will  be.  If  we  compute  S  on  the  unpartitioned 
sample,  there  will  be  distortion  of  the  variances  and  covariances  because  of  the 
groups  in  the  data  (assuming  there  really  are  some  natural  clusters).  We  therefore 
usually  use  the  Euclidean  distance  given  by  (14.2).  In  some  clustering  procedures,  it 
is  not  necessary  to  take  the  square  root  in  (14.2)  or  (14.3). 

Other  distance  measures  have  been  suggested,  for  example,  the  Minkowski  metric 


d(x,  y) 


£  I  *j  -  y/ 


i  vr 


j= i 


(14.4) 


For  r  =  2,  d(x,  y)  in  (14.4)  becomes  the  Euclidean  distance  given  in  (14.2).  For  p  = 
2  and  r  =  1,  (14.4)  measures  the  “city  block”  distance  between  two  observations. 
There  are  distance  measures  for  categorical  data;  see  Gordon  (1999,  Chapter  2). 

For  the  n  observation  vectors  yi,  y2, . . .  ,  y„,  we  can  compute  an  n  x  n  matrix 
D  =  (dij)  of  distances  (or  dissimilarities),  where  dij  —  d(yj,yj)  is  usually  given 
by  (14.2),  d(yi,yj)  =  ^(y;  -  y7-)'(y,-  -  y j).  We  sometimes  use  D  =  (</?■),  where 
dfj  =  d2(ji,yj)  =  (y,- -yjYiyi -yj)  is  the  square  of  (14.2).  The  matrix D  typically 
is  symmetric  with  diagonal  elements  equal  to  zero. 

The  scale  of  measurement  of  the  variables  is  an  important  consideration  when 
using  the  Euclidean  distance  measure  in  (14.2).  Changing  the  scale  can  affect  the 
relative  distances  among  the  items.  For  example,  suppose  three  items  have  the  fol¬ 
lowing  bivariate  measurements  (yi,  yi):  (2,  5),  (4,  2),  (7,  9).  Using  dij  as  given  by 
(14.2),  the  matrix  D  =  (dij)  for  these  items  is 


/  0.0 

3.6 

6.4\ 

D,  = 

=  3.6 

0.0 

7.6 

\  6.4 

7.6 

0.0  / 

However,  if  we  multiply  yi  by 

100  as, 

for  example. 

centimeters,  the  matrix  becomes 

(  ° 

200 

500  ' 

d2  = 

200 

0 

300 

\  500 

300 

0 

changing  from  meters  to 


and  the  largest  distance  is  now  d  13  instead  of  1/23-  The  distance  rankings  have  been 
altered  by  scaling. 

To  counter  this  problem,  each  variable  could  be  standardized  in  the  usual  way  by 
subtracting  the  mean  and  dividing  by  the  standard  deviation  of  the  variable.  However, 
such  scaling  would  ordinarily  be  based  on  the  entire  data  set,  that  is,  on  all  n  values  in 
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each  column  of  Y  in  (14.1).  In  this  case,  the  variables  that  best  separate  clusters  might 
no  longer  do  so  after  division  by  standard  deviations  that  include  between-cluster 
variation.  If  we  use  standardized  variables,  the  clusters  could  be  less  well  separated. 
The  question  of  scaling  is,  therefore,  not  an  easy  one.  However,  standardization  of 
this  type  is  recommended  by  many  authors. 

By  (14.2),  the  squared  Euclidean  distance  between  two  observations  x  = 
(xi,X2, ,  xpY  and  y  =  (vi,  V2,  •  •  •  ,  yPY  is  d2(x ,  y)  =  Y.'j=\(xj  ~  yj )2-  This 
can  be  expressed  as 

d2(x,  y)  =  (vx  -  Vy)2  +  p(x  -  y)2  +  2vxvy(l  -  rxy ),  (14.5) 


where  v2  —  (xj  —  x>  and  x  =  "Yfj=i  •*/'/ P>  with  similar  expressions  for  v2 
and  y.  The  correlation  rxy  in  (14.5)  is  given  by 


Y,pi=i(xj  -  x)(yj  -  y) 


(14.6) 


In  Figure  14.1,  we  illustrate  the  profile  (see  Sections  5.9  and  6.8)  for  each  of  two 
observation  vectors  x  and  y.  The  squared  Eulcidean  distance  in  (14.5)  can  be  used  to 
compare  the  profiles  of  x  and  y  in  terms  of  levels,  variation,  and  shape,  where  1  and 
y  are  the  levels  of  the  two  profiles,  vx  and  vy  are  the  variations  of  the  profiles,  and 
the  correlation  rxy  is  a  measure  of  the  closeness  of  the  shapes  of  the  two  profiles. 
The  closer  rxy  is  to  1,  the  greater  is  the  similarity  in  shape  of  the  two  profiles.  Note 
that  x  and  vx  are  the  mean  and  variation  of  the  p  variables  within  the  observation 
vector  x,  not  over  the  n  observations  in  the  data  set.  A  similar  comment  can  be  made 
about  y  and  vy.  Likewise,  the  correlation  rxy  is  between  the  two  observation  vectors 
x  and  y,  not  between  two  variables.  The  use  of  rxy  has  been  questioned  by  Jardine 
and  Sibson  (1971)  and  Wishart  (1971),  but  Strauss,  Bartko,  and  Carpenter  (1973) 


Figure  14.1.  Profiles  for  two  observation  vectors  x  and  y. 
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found  the  correlation  to  be  superior  to  the  Euclidean  distance  for  finding  the  clusters 
in  a  particular  data  set. 


14.3  HIERARCHICAL  CLUSTERING 
14.3.1  Introduction 

Hierarchical  methods  and  other  clustering  algorithms  represent  an  attempt  to  find 
“good”  clusters  in  the  data  using  a  computationally  efficient  technique.  It  is  not  gen¬ 
erally  feasible  to  examine  all  possible  clustering  possibilities  for  a  data  set,  especially 
a  large  one.  The  number  of  ways  of  partitioning  a  set  of  n  items  into  g  clusters  is 
given  by 


N(n,  g) 


(-1  Y~kkn 


(14.7) 


[see  Duran  and  Odell  (1974,  Chapter  4),  Jensen  (1969),  and  Seber  (1984,  p.  379)]. 
This  can  be  approximated  by  gn / g\,  which  is  large  even  for  moderate  values  of  n 
and  g.  For  example,  N (25.  10)  =  2.8  x  1018.  The  total  possible  number  of  clusters 
for  a  set  of  n  items  is  Yl'g=i  N(n,  g),  which,  for  n  =  25,  is  greater  than  1019.  Hence, 
hierarchical  methods  and  other  approaches  permit  us  to  search  for  a  reasonable  solu¬ 
tion  without  having  to  look  at  all  possible  arrangements. 

As  noted  in  Section  14.1,  hierarchical  clustering  algorithms  involve  a  sequential 
process.  In  each  step  of  the  agglomerative  hierarchical  approach,  an  observation  or 
a  cluster  of  observations  is  merged  into  another  cluster.  In  this  process,  the  number 
of  clusters  shrinks  and  the  clusters  themselves  grow  larger.  We  start  with  n  clus¬ 
ters  (individual  items)  and  end  with  one  single  cluster  containing  the  entire  data 
set.  An  alternative  approach,  called  the  divisive  method,  starts  with  a  single  cluster 
containing  all  n  items  and  partitions  a  cluster  into  two  clusters  at  each  step  (see  Sec¬ 
tion  14.3.10).  The  end  result  of  the  divisive  approach  is  n  clusters  of  one  item  each. 
Agglomerative  methods  are  more  commonly  used  than  divisive  methods.  In  either 
type  of  hierarchical  clustering,  a  decision  must  be  made  as  to  the  optimal  number  of 
clusters  (see  Section  14.5). 

At  each  step  of  an  agglomerative  hierarchical  approach,  the  two  closest  clusters 
are  merged  into  a  single  new  cluster.  The  process  is  therefore  irreversible  in  the  sense 
that  any  two  items  that  are  once  lumped  together  in  a  cluster  cannot  be  separated  later 
in  the  procedure;  any  early  mistakes  cannot  be  corrected.  Similarly,  in  a  divisive 
hierarchical  method,  items  cannot  be  moved  to  other  clusters.  An  optional  approach 
is  to  carry  out  a  hierarchical  procedure  followed  by  a  partitioning  procedure  in  which 
items  can  be  moved  from  one  cluster  to  another  (see  Section  14.4.1). 

Since  an  agglomerative  hierarchical  procedure  combines  the  two  closest  clusters 
at  each  step,  we  must  consider  the  question  of  measuring  the  similarity  or  dissimi¬ 
larity  of  two  clusters.  Different  approaches  to  measuring  distance  between  clusters 
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give  rise  to  different  hierarchical  methods.  Agglomerative  techniques  are  discussed 
in  Sections  14.3.2-14.3.9,  and  the  divisive  approach  is  considered  in  Section  14.3.10. 


14.3.2  Single  Linkage  (Nearest  Neighbor) 

In  the  single  linkage  method,  the  distance  between  two  clusters  A  and  B  is  defined 
as  the  minimum  distance  between  a  point  in  A  and  a  point  in  B  : 

D(A,  B)  —  min{<7(y,\  V/),  fory,-  in  A  and  y j  in  B\.  (14.8) 

where  d(yt,  yj )  is  the  Euclidean  distance  in  (14.2)  or  some  other  distance  between 
the  vectors  y,  and  y j .  This  approach  is  also  called  the  nearest  neighbor  method. 

At  each  step  in  the  single  linkage  method,  the  distance  (14.8)  is  found  for  every 
pair  of  clusters,  and  the  two  clusters  with  smallest  distance  are  merged.  The  number 
of  clusters  is  therefore  reduced  by  1 .  After  two  clusters  are  merged,  the  procedure  is 
repeated  for  the  next  step:  the  distances  between  all  pairs  of  clusters  are  calculated 
again,  and  the  pair  with  minimum  distance  is  merged  into  a  single  cluster. 

The  results  of  a  hierarchical  clustering  procedure  can  be  displayed  graphically 
using  a  tree  diagram ,  also  known  as  a  dendrogram,  which  shows  all  the  steps  in  the 
hierarchical  procedure,  including  the  distances  at  which  clusters  are  merged.  Den¬ 
drograms  are  shown  in  Figures  14.2  and  14.3  in  Examples  14.3.2(a)  and  14.3.2(b). 

Example  14.3.2(a).  Hartigan  (1975a,  p.  28)  compared  the  crime  rates  per  100,000 
population  for  various  cities.  The  data  are  in  Table  14.1  (taken  from  the  1970  U.S. 


Table  14.1.  City  Crime  Rates  per  100,000  Population 


City 

Murder 

Rape 

Robbery 

Assault 

Burglary 

Larceny 

Auto  Theft 

Atlanta 

16.5 

24.8 

106 

147 

1112 

905 

494 

Boston 

4.2 

13.3 

122 

90 

982 

669 

954 

Chicago 

11.6 

24.7 

340 

242 

808 

609 

645 

Dallas 

18.1 

34.2 

184 

293 

1668 

901 

602 

Denver 

6.9 

41.5 

173 

191 

1534 

1368 

780 

Detroit 

13.0 

35.7 

477 

220 

1566 

1183 

788 

Hartford 

2.5 

0° 

oo 

68 

103 

1017 

724 

468 

Honolulu 

3.6 

12.7 

42 

28 

1457 

1102 

637 

Houston 

16.8 

26.6 

289 

186 

1509 

787 

697 

Kansas  City 

10.8 

43.2 

255 

226 

1494 

955 

765 

Los  Angeles 

9.7 

51.8 

286 

355 

1902 

1386 

862 

New  Orleans 

10.3 

39.7 

266 

283 

1056 

1036 

776 

New  York 

9.4 

19.4 

522 

267 

1674 

1392 

848 

Portland 

5.0 

23.0 

157 

144 

1530 

1281 

488 

Tucson 

5.1 

22.9 

85 

148 

1206 

756 

483 

Washington 

12.5 

27.6 

524 

217 

1496 

1003 

793 
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Statistical  Abstract).  In  order  to  illustrate  the  use  of  the  distance  matrix  in  single 
linkage  clustering,  we  use  the  first  six  observations  in  Table  14.1  (Atlanta  through 
Detroit). 

The  distance  matrix  D  is  given  by 


City 

Distance  between  Cities 

Atlanta 

0 

536.6 

516.4 

590.2 

693.6 

716.2 

Boston 

536.6 

0 

447.4 

833.1 

915.0 

881.1 

Chicago 

516.4 

447.4 

0 

924.0 

1073.4 

971.5 

Dallas 

590.2 

833.1 

924.0 

0 

527.7 

464.5 

Denver 

693.6 

915.0 

1073.4 

527.7 

0 

358.7 

Detroit 

716.2 

881.1 

971.5 

464.5 

358.7 

0 

The  smallest  distance  is  358.7  between  Denver  and  Detroit,  and  therefore  these 
two  cities  are  joined  at  the  first  step  to  form  C\  —  {Denver,  Detroit}.  In  the  next  step, 
the  distance  matrix  is  calculated  for  Atlanta,  Boston,  Chicago,  Dallas,  and  Ci: 


Atlanta 

0 

536.6 

516.4 

590.2 

693.6 

Boston 

536.6 

0 

447.4 

833.1 

881.1 

Chicago 

516.4 

447.4 

0 

924.0 

971.5 

Dallas 

590.2 

833.1 

924.0 

0 

464.5 

Ci 

693.6 

881.1 

971.5 

464.5 

0 

Note  that  all  elements  of  this  distance  matrix  are  contained  in  the  original  dis¬ 
tance  matrix.  This  same  pattern  will  hold  in  subsequent  distance  matrices  and  is  also 
characteristic  of  the  complete  linkage  method  [see  Example  14.3.3(a)].  The  smallest 
distance  is  447.4  between  Boston  and  Chicago.  Therefore  Ci  =  {Boston,  Chicago}. 
At  the  next  step,  the  distance  matrix  is  calculated  for  Atlanta,  Dallas,  C i,  and  C2: 


Atlanta 

0 

516.4 

590.2 

693.6 

c2 

516.4 

0 

833.1 

881.1 

Dallas 

590.2 

833.1 

0 

464.5 

Ci 

693.6 

881.1 

464.5 

0 

The  smallest  distance  is  464.5  between  Dallas  and  Ci,  so  that  C 3  =  {Dallas,  Cj}. 
The  distance  matrix  for  Atlanta,  C2,  and  C 3  is  given  by 

Atlanta  0  516.4  590.2 

C2  516.4  0  833.1 

C3  590.2  833.1  0 

The  smallest  distance  is  516.4,  which  defines  C4  =  {Atlanta,  C2}.  The  distance 
matrix  for  C 3  and  C4  is 


C3  0  590.2 

C4  590.2  0 
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City 

Detroit 

Denver 


Dallas 


Chicago 


Boston 


Atlanta 


600  400  200  0 

Minimum  distance  between  clusters 

Figure  14.2.  Dendrogram  for  single  linkage  of  the  first  six  observations  in  the  city  crime  data 
in  Table  14.1  [See  Example  14.3.2(a)]. 


The  last  cluster  is  given  by  C 5  =  {C3,  C4}.  The  dendrogram  for  the  steps  in  this 
example  is  given  in  Figure  14.2.  The  order  in  which  the  clusters  were  formed  and  the 
relative  distances  at  which  they  formed  can  all  be  seen.  Note  that  the  distance  scale 
runs  from  right  to  left.  □ 


Example  14.3.2(b).  To  further  illustrate  the  single  linkage  method  of  clustering,  we 
use  the  complete  city  crime  data  from  Table  14.1.  The  dendrogram  in  Figure  14.3 
shows  the  cluster  groupings  attained  by  the  single  linkage  method.  □ 
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City 

Boston 

Portland 

Honolulu 

Denver 

Los  Angeles 

New  York 

Washington 

Detroit 

Kansas  City 
Houston 

Dallas 

Chicago 
New  Orleans 

Hartford 

Tucson 

Atlanta 


Minimum  distance  between  clusters 

Figure  14.3.  Dendrogram  for  single  linkage  of  the  complete  city  crime  data  from  Table  14.1 
[see  Example  14.3.2(b)]. 


14.3.3  Complete  Linkage  (Farthest  Neighbor) 

In  the  complete  linkage  approach,  also  called  the  farthest  neighbor  method,  the  dis¬ 
tance  between  two  clusters  A  and  B  is  defined  as  the  maximum  distance  between  a 
point  in  A  and  a  point  in  B : 
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D(A,  B )  =  max{r/(y;,  y j)  for  y,-  in  A  and  y;  in  B}.  (14.9) 


At  each  step,  the  distance  (14.9)  is  found  for  every  pair  of  clusters,  and  the  two 
clusters  with  the  smallest  distance  are  merged. 


Example  14.3.3(a).  As  in  Example  14.3.2(a)  for  single  linkage  clustering,  we  illus¬ 
trate  the  use  of  the  distance  matrix  in  complete  linkage  clustering  with  the  first  six 
observations  of  the  city  crime  data  in  Table  14.1.  The  initial  distance  matrix  is  exactly 
the  same  as  in  Example  14.3.2(a): 


City 

Distance 

Atlanta 

0 

536.6 

516.4 

590.2 

693.6 

716.2 

Boston 

536.6 

0 

447.4 

833.1 

915.0 

881.1 

Chicago 

516.4 

447.4 

0 

924.0 

1073.4 

971.5 

Dallas 

590.2 

833.1 

924.0 

0 

527.7 

464.5 

Denver 

693.6 

915.0 

1073.4 

527.7 

0 

358.7 

Detroit 

716.2 

881.1 

971.5 

464.5 

358.7 

0 

The  smallest  distance  is  358.7  between  Denver  and  Detroit,  and  these  two  there¬ 
fore  form  the  first  cluster,  C i  =  {Denver,  Detroit}.  Note  that  since  the  first  cluster  is 
based  on  the  initial  distance  matrix,  it  will  be  the  same  regardless  of  which  hierar¬ 
chical  clustering  method  is  used. 

In  the  next  step,  the  distance  matrix  is  calculated  for  Atlanta,  Boston,  Chicago, 
Dallas,  and  C\ : 


Atlanta 

0 

536.6 

516.4 

590.2 

716.2 

Boston 

536.6 

0 

447.4 

833.1 

915.0 

Chicago 

516.4 

447.4 

0 

924.0 

1073.4 

Dallas 

590.2 

833.1 

924.0 

0 

527.7 

Ci 

716.2 

915.0 

1073.4 

527.7 

0 

Note  that  this  distance  matrix  differs  from  its  analog  for  the  second  step  in 
Example  14.3.2(a)  only  in  the  distances  between  C \  and  the  other  cities.  All  ele¬ 
ments  of  this  matrix  and  subsequent  distance  matrices  below  are  contained  in  the 
original  distance  matrix  for  the  six  cities.  The  smallest  distance  is  447.4  between 
Boston  and  Chicago.  Therefore,  Cj  =  {Boston,  Chicago}.  At  the  next  step,  distances 
are  calculated  for  Atlanta,  Dallas,  Ci,  and  C2' 


Atlanta 

0 

536.6 

590.2 

716.2 

c2 

536.6 

0 

924.0 

833.1 

Dallas 

590.2 

924.0 

0 

527.7 

Ci 

693.6 

881.1 

527.7 

0 

The  smallest  distance,  527.7,  defines  C 3  =  {Dallas,  C\ }.  The  distance  matrix  for 
Atlanta,  C2,  and  C3  is  given  by 
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Atlanta  0  536.6  716.2 

C2  536.6  0  1073.4 

C3  590.2  1073.4  0 

The  smallest  distance  is  536.6  between  Atlanta  and  C 3,  so  that  C4  =  {Atlanta, 
C 2}.  The  distance  matrix  for  C 3  and  C 4  is 

C3  0  1073.4 

C4  1073.4  0 

The  last  cluster  is  given  by  C 5  =  {C3,  C4}.  The  dendrogram  in  Figure  14.4  shows 
the  steps  in  this  example.  □ 


City 

Detroit 

Denver 

Dallas 

Chicago 

Boston 

Atlanta 


1250  1000  750  500  250  0 

Minimum  distance  between  clusters 

Figure  14.4.  Dendrogram  for  complete  linkage  of  the  first  six  observations  in  the  city  crime 
data  in  Table  14.1  [see  Example  14.3.3(a)]. 
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Example  14.3.3(b).  To  further  illustrate  the  complete  linkage  method,  we  use  the 
complete  crime  data  in  Table  14.1.  The  dendrogram  in  Figure  14.5  shows  the  clusters 
found  for  this  data  set  by  the  complete  linkage  approach.  There  are  some  differences 
between  these  groupings  and  the  groupings  from  single  linkage  in  Figure  14.3.  □ 

City 

New  York 

- Los  Angeles 

- Portland 

- Honolulu 

- Denver 

-  - Washington 

- Detroit 

- Kansas  City 

- Houston 

- Dallas 

Chicago 

- Boston 

- New  Orleans 

- Hartford 

-  Tucson 

- Atlanta 

I  I  I  I  1  I  1 

1500  1250  1000  750  500  250  0 

Maximum  distance  between  clusters 


Figure  14.5.  Dendrogram  for  complete  linkage  of  the  complete  city  crime  data  of  Table  14. 1 
[see  Example  14.3.3(b)]. 
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14.3.4  Average  Linkage 

In  the  average  linkage  approach,  the  distance  between  two  clusters  A  and  B  is 
defined  as  the  average  of  the  n 4 n B  distances  between  the  ha  points  in  A  and  the 
n  b  points  in  B : 


1  HA  «B 

D(A,  B)  — - vy>(y;,y ,),  (14-10) 


where  the  sum  is  over  all  y,  in  A  and  all  y j  in  B.  At  each  step,  we  join  the  two 
clusters  with  the  smallest  distance,  as  measured  by  (14.10). 


Example  14.3.4.  Figure  14.6  shows  the  dendrogram  resulting  from  the  average  link¬ 
age  method  applied  to  the  city  crime  data  in  Table  14.1.  The  solution  is  the  same  as 
the  complete  linkage  solution  for  this  data  set  given  in  Example  14.3.3(b)  and  Fig¬ 
ure  14.5.  □ 


14.3.5  Centroid 

In  the  centroid  method,  the  distance  between  two  clusters  A  and  B  is  defined  as 
the  Euclidean  distance  between  the  mean  vectors  (often  called  centroids)  of  the  two 
clusters: 


D(A,  B )  =  d( yA,yB), 


(14.11) 


where  yA  and  yB  are  the  mean  vectors  for  the  observation  vectors  in  A  and  the 
observation  vectors  in  B,  respectively,  and  d(yA,  yB)  is  defined  in  (14.2).  We  define 
yA  and  yB  in  the  usual  way,  that  is,  yA  =  Xw=i  y i/nA-  The  two  clusters  with  the 
smallest  distance  between  centroids  are  merged  at  each  step. 

After  two  clusters  A  and  B  are  joined,  the  centroid  of  the  new  cluster  A  B  is  given 
by  the  weighted  average 


y  ab  — 


nAyA  +  nB yB 
HA  +  nB 


(14.12) 


Example  14.3.5.  Figure  14.7  shows  the  dendrogram  resulting  from  using  the  cen¬ 
troid  clustering  method  on  the  complete  city  crime  data  in  Table  14.1. 

Note  the  two  crossovers  in  the  dendrogram  in  Figure  14.7.  Boston  and  Chicago 
join  at  a  distance  of  447.4.  Then  that  cluster  joins  with  {Atlanta,  Tucson,  Hartford} 
at  a  distance  of  441.1.  Finally,  all  five  join  with  New  Orleans  at  a  distance  of  393.8. 
Crossovers  are  discussed  in  Section  14.3.8a.  □ 
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Figure  14.6.  Dendrogram  for  average  linkage  clustering  of  the  data  in  Table  14.1  (see  Exam¬ 
ple  14.3.4). 
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Figure  14.7.  Dendrogram  for  the  centroid  clustering  of  the  complete  city  crime  data  in  Table 
14.1  (see  Example  14.3.5). 
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14.3.6  Median 

If  two  clusters  A  and  B  are  combined  using  the  centroid  method,  and  if  A  contains  a 
larger  number  of  items  than  B ,  then  the  new  centroid  yAB  =  (n  AyA  +  n  By  B)  /  (n  a  + 
nB)  may  be  much  closer  to  yA  than  to  yB.  To  avoid  weighting  the  mean  vectors 
according  to  cluster  size,  we  can  use  the  median  (midpoint)  of  the  line  joining  A  and 
B  as  the  point  for  computing  new  distances  to  other  clusters: 

nh4£  =  j(yA  +  yB).  (14.13) 

The  two  clusters  with  the  smallest  distance  between  medians  are  merged  at  each  step. 

Note  that  the  median  in  (14.13)  is  not  the  ordinary  median  in  the  statistical  sense. 
The  terminology  arises  from  a  median  of  a  triangle,  namely,  the  line  from  a  vertex  to 
the  midpoint  of  the  opposite  side. 


Example  14.3.6.  Figure  14.8  shows  the  dendrogram  resulting  from  using  the 
median  distance  clustering  method  on  the  complete  city  crime  data  in  Table  14.1.  In 
Figure  14.8,  we  see  the  same  two  crossovers  as  in  Figure  14.7.  □ 


14.3.7  Ward’s  Method 

Ward’s  method,  also  called  the  incremental  sum  of  squares  method,  uses  the  within- 
cluster  (squared)  distances  and  the  between-cluster  (squared)  distances  (Ward  1963, 
Wishart  1969a).  If  A  B  is  the  cluster  obtained  by  combining  clusters  A  and  B,  then 
the  sum  of  within-cluster  distances  (of  the  items  from  the  cluster  mean  vectors)  are 


nA 

SSEa  =  ^(y,-  -  yA)'(y;  -  yA), 
i= 1 

(14.14) 

nB 

sseb  =  Yjyt  -  yfi)'(y/  -  y B)> 

i= 1 

(14.15) 

nAB 

SSEab  =  ^(y,  -  yAg)'(y/  -  yAB), 

i = 1 

(14.16) 

where  yAB  =  (nAyA  +  nByB)/(nA  +  nB),  as  in  (14.12),  and  nA,nB,  and  nAB  = 
nA  +  nB  are  the  numbers  of  points  in  A,  B,  and  A  B,  respectively.  Since  these  sums 
of  distances  are  equivalent  to  within-cluster  sums  of  squares,  they  are  denoted  by 
SSEa,  SSEg,  and  SSEAg. 

Ward’s  method  joins  the  two  clusters  A  and  B  that  minimize  the  increase  in  SSE, 
defined  as 


Iab  —  SSEAB  —  (SSEA  +  SSEg). 


(14.17) 
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Figure  14.8.  Dendrogram  for  the  median  clustering  method  applied  to  the  complete  city  crime 
data  in  Table  14.1  (see  Example  14.3.6). 
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It  can  be  shown  that  the  increase  IAB  in  (14.17)  has  the  following  two  equivalent 
forms: 


Iab  =  nA(yA  -  yABy(yA  -  y ab)  +  ««( yB  -  yAB)'(yB  -  y Ab)  (i4.i8) 


nAnB 
nA  +  nB 


(yA  -yB)'(yA  -y b)- 


(14.19) 


Thus  by  (14.19),  minimizing  the  increase  in  SSE  is  equivalent  to  minimizing  the 
between-cluster  distances.  If  A  consists  only  of  y,-  and  B  consists  only  of  y j,  then 
SSEa  and  SSEg  are  zero,  and  (14.17)  and  (14.19)  reduce  to 

Iij  =  SSE,,, K  =  \(y -  y j)\y-,  -  y j)  =  |d2(y,-,  y;-). 


Ward’s  method  is  related  to  the  centroid  method  in  Section  14.3.5.  If  the  distance 
d{ yA,  yB)  in  (14.1 1)  is  squared  and  compared  to  (14.19),  the  only  difference  is  the 
coefficient  n n  s/(«a+«b)  forWard’s  method.  Thus  the  cluster  sizes  have  an  impact 
on  Ward's  method  but  not  on  the  centroid  method.  Writing  «a«r/(«a  +  ng)  in 
(14.19)  as 


nAnB  _  1 

nA+nB  1/«a+1/«b’ 

we  see  that  as  nA  and  nB  increase,  «a«r/(ha+«b)  increases.  Writing  the  coefficient 
as 


«a«b  _  «A 
nA  +  nB  1  +  nA/nB  ’ 

we  see  that  as  n B  increases  with  nA  fixed,  nAnB/(nA  +  ng)  increases.  Therefore, 
compared  to  the  centroid  method.  Ward’s  method  is  more  likely  to  join  smaller  clus¬ 
ters  or  clusters  of  equal  size. 


Example  14.3.7.  Figure  14.9  shows  the  dendrogram  resulting  from  using  Ward’s 
clustering  method  on  the  complete  city  crime  data  in  Table  14.1.  The  vertical  axis  is 
IAB /  Yl'!=  l  (y i  ~  y)'(y;  —  y),  where  y  is  the  overall  mean  vector  for  the  data.  □ 


14.3.8  Flexible  Beta  Method 

Suppose  clusters  A  and  B  have  just  been  merged  to  form  cluster  AB.  A  general 
formula  for  the  distance  between  AB  and  any  other  cluster  C  was  given  by  Lance 
and  Williams  (1967): 
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City 
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Increase  in  SSE 


Figure  14.9.  Dendrogram  for  Ward's  method  applied  to  the  complete  city  crime  data  in  Table 
14.1  (see  Example  14.3.7). 
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D(C,  AB)  =  aAD(C ,  A)  +  aBD(C ,  B )  +  ftD(A,  B)  +  y\D{C,  A)  -  D(C,  B)\. 

(14.20) 

The  distances  D(C ,  A),  /XC,  B),  and  /X 4 ,  B)  are  from  the  distance  matrix  before 
joining  A  and  B.  The  distances  from  AB  to  other  clusters  as  given  by  (14.20)  would 
be  used  (along  with  distances  between  other  pairs  of  clusters)  to  form  the  next  dis¬ 
tance  matrix  for  choosing  the  pair  of  clusters  with  smallest  distance.  This  pair  would 
then  be  joined  at  the  next  step. 

To  simplify  (14.20),  Lance  and  Williams  (1967)  suggested  the  following  con¬ 
straints  on  the  parameter  values: 

a  A  +  oi  b  +  ft  =  1, 

&A  =  UB, 

Y=  0, 
ft  <  I- 

With  a A  =  aB  and  y  —  0,  we  have  2a A  —  1  —  ft  or  aA  =  aB  —  (1  —  ft)/2,  and 
we  need  only  choose  a  value  of  ft .  The  resulting  hierarchical  clustering  procedure  is 
called  th e  flexible  beta  method. 

The  choice  of  /3  determines  the  characteristics  of  the  flexible  beta  clustering  pro¬ 
cedure.  Lance  and  Williams  (1967)  suggested  the  use  of  a  small  negative  value  of  /l, 
such  as  /i  =  —.25.  If  there  are  (or  might  be)  outliers  in  the  data,  the  use  of  a  smaller 
value  of  ft,  such  as  ft  —  —.5,  may  be  more  likely  to  isolate  these  outliers  into  simple 
clusters. 

The  distances  defined  for  the  agglomerative  hierarchical  methods  in  Sections 
14.3.2-14.3.7  can  all  be  expressed  as  special  cases  of  (14.20).  The  requisite  parame¬ 
ter  values  are  given  in  Table  14.2.  For  the  centroid,  median,  and  Ward’s  methods,  the 


Table  14.2.  Parameter  Values  for  (14.20) 


Cluster  Method 

aA 

OtB 

ft 

Y 

1 

1 

1 

Single  linkage 

2 

2 

0 

~2 

Complete  linkage 

1 

2 

1 

2 

0 

1 

2 

Average  linkage 

nA 

nB 

0 

0 

nA  +  nB 

nA  +  nB 

Centroid 

nA 

nB 

~nAnB 

0 

nA  +  nB 

nA  +nB 

C nA  +  nB)2 

1 

1 

1 

Median 

0 

Ward’s  method 

2 

nA  +  nc 

2 

nB  +  nc 

~4 

-»c 

0 

nA  +  n  B  +  n  c 

nA  +  nB  +  nc 

nA  +  nB  +  »C 

Flexible  beta 

(1  -  yS)/2 

(1  -  yS)/2 

ft(<  1) 

0 
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distances  in  (14.20)  must  be  squared  distances  (assuming  Euclidean  distances).  For 
the  other  methods  in  Table  14.2,  the  distances  may  be  either  squared  or  unsquared. 

We  illustrate  the  choice  of  parameter  values  in  Table  14.2  for  the  single  linkage 
method.  Using  a  a  =  &b  =  j,  P  —  0,  and  y  =  —  \  as  in  the  first  row  of  Table  14.2, 


(14.20)  becomes 

D{C,  AB)  =  \D{C ,  A)  +  \D(C,  B)  -  \\ D(C,  A)  -  D(C ,  B) |.  (14.21) 

If  D{C,  A)  >  D(C,  B),  then  \D(C ,  A)  -  D(C,  B)\  =  D(C ,  A)  -  D(C,  B ),  and 

(14.21)  reduces  to 

D(C,  AB)  —  D(C,  B).  (14.22) 

On  the  other  hand,  if  D(C,  A)  <  D(C,  B),then  \D(C,  A)  —  D(C,  B)\  =  D(C,B)~ 
D(C,  A),  and  (14.21)  reduces  to 

D(C,  AB)  =  D{C ,  A).  (14.23) 

Thus,  (14.21)  can  be  written  as 

D(C,  AB)  =  min[D(C,  A),  D(C,  £)],  (14.24) 


which  is  equivalent  to  (14.8),  the  definition  of  distance  for  the  single  linkage  method. 

Example  14.3.8.  Figures  14.10  and  14.1 1  show  dendrograms  produced  when  using 
the  flexible  beta  clustering  method  on  the  complete  city  crime  data  in  Table  14.1, 
with  f)  —  —.25  and  /3  =  —.75.  The  two  results  are  similar.  □ 

14.3.9  Properties  of  Hierarchical  Methods 
14.3.9a  Monotonicity 

If  an  item  or  a  cluster  joins  another  cluster  at  a  distance  that  is  less  than  the  distance 
for  the  previous  merger  of  two  clusters,  we  say  that  an  inversion  or  a  reversal  has 
occurred.  The  reversal  is  represented  by  a  crossover  in  the  dendrogram.  Examples  of 
crossovers  can  be  found  in  Figures  14.7  and  14.8. 

A  hierarchical  method  in  which  reversals  cannot  occur  is  said  to  be  monotonic, 
because  the  distance  at  each  step  is  greater  than  the  distance  at  the  previous  step.  A 
distance  measure  or  clustering  method  that  is  monotonic  is  also  called  ultrametric. 

We  now  show  that  the  single  linkage  and  complete  linkage  methods  are  mono¬ 
tonic.  Let  dk  be  the  distance  at  which  two  clusters  are  joined  at  the  A  th  step.  We  can 
describe  steps  k  and  k+ 1  in  terms  of  four  clusters  A,  B ,  C,  and  D.  Suppose  D(A,  B) 
is  less  than  the  distance  between  any  other  pair  among  these  four  clusters,  so  that  A 
and  B  are  joined  at  step  k  to  form  AB.  Then 


dk  =  D(A,  B)  <  min{D(A,  C),  D(B,  C),  D(C,  £>)}. 


(14.25) 
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Figure  14.10.  Dendrogram  for  the  flexible  beta  method  with  ft  =  —.25  applied  to  the  com¬ 
plete  city  crime  data  in  Table  14.1  (see  Example  14.3.8). 
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Flexible  beta  distance 

Figure  14.11.  Dendrogram  for  the  flexible  beta  method  with  fi  =  —  .75  applied  to  the  com¬ 
plete  city  crime  data  in  Table  14. 1  (see  Example  14.3.8). 
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| If  l)( A.  B)  is  less  than  these  three  distances,  it  is  less  than  the  other  two  possible 
distances,  D(A ,  D)  and  D(B ,  Z)).]  Suppose  at  step  k+ 1  we  join  AB  and  C  or  we  join 
C  and  D.  If  we  merge  C  and  D,  then  by  (14.25),  dk  =  D(A,  B)  <  D(C,  D)  =  dk+\. 
If  we  join  AB  and  C,  then  for  single  linkage  (14.24)  gives 

dk+i  =  D(C,  AB)  =  min{D(A,  C),  D(B ,  C)}  >  dk  =  D(A,  B). 

By  (14.25),  both  of  D(A ,  C)  and  D(B,  C )  exceed  D{A ,  B ),  and  this  also  holds 
for  complete  linkage.  Thus,  the  single  linkage  and  complete  linkage  methods  are 
monotonic. 

For  the  methods  in  Table  14.2  other  than  single  linkage  and  complete  linkage,  we 
have  y  —  O',  then  by  (14.20)  and  (14.25), 

D(C,  AB)  >  {a A  +ub  +  P)D(A,  B).  (14.26) 

Thus  we  need  a  a  +  cub  +  P  >  1  for  monotonicity.  Using  this  criterion,  we  see  that 
all  methods  in  Table  14.1  (beyond  the  first  two)  are  monotonic  except  the  centroid 
and  median  methods.  (These  two  methods  showed  crossovers  in  the  dendrograms 
in  Figures  14.7  and  14.8.)  Because  of  lack  of  monotonicity,  some  authors  do  not 
recommend  the  centroid  and  median  methods. 

14.3.9b  Contraction  or  Dilation 

We  now  consider  the  characteristics  of  the  distances  or  proximities  between  the  orig¬ 
inal  points.  As  clusters  form,  the  properties  of  this  space  of  distances  may  be  altered 
somewhat.  A  clustering  method  that  does  not  alter  the  spatial  properties  is  referred 
to  by  Lance  and  Williams  (1967)  as  space-consendng.  A  method  that  is  not  space- 
conserving  may  either  contract  or  dilate  the  space. 

A  method  is  space-contracting  if  newly  formed  clusters  appear  to  move  closer  to 
individual  observations,  so  that  an  individual  item  tends  to  join  an  existing  cluster 
rather  than  join  with  another  individual  item  to  form  a  new  cluster.  This  tendency  is 
also  called  chaining. 

A  method  is  space-dilating  if  newly  formed  clusters  appear  to  move  away  from 
individual  observations,  so  that  individual  items  tend  to  form  new  clusters  rather  than 
join  existing  clusters.  In  this  case,  clusters  appear  to  be  more  distinct  than  they  are. 

Dubien  and  Warde  (1979)  described  the  spatial  properties  as  follows.  Suppose 
that  the  distances  among  three  clusters  satisfy 

D(A,  B)  <  D(A,  C )  <  D(B,  C). 

Then  a  cluster  method  is  space-conserving  if 

D(A,C)  <  D(AB,C)  <  D(B,C).  (14.27) 

A  method  is  space-contracting  if  the  first  inequality  in  (14.27)  does  not  hold  and 
space-dilating  if  the  second  inequality  does  not  hold. 
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The  single  linkage  method  is  very  space-contracting,  with  marked  chaining  ten¬ 
dencies.  For  this  reason,  single  linkage  is  not  recommended  by  some  authors.  Com¬ 
plete  linkage  on  the  other  hand,  is  very  space-dilating,  with  a  tendency  to  artificially 
impose  cluster  boundaries. 

Other  hierarchical  methods  fall  in  between  the  extremes  represented  by 
single  linkage  and  complete  linkage.  The  centroid  and  average  linkage  methods 
are  largely  space-conserving,  whereas  Ward’s  method  is  space-contracting.  When¬ 
ever  a  method  produces  reversals  for  a  particular  data  set,  it  can  be  considered  to 
be  space-contracting.  Thus,  for  example,  the  centroid  method  is  space-conserving 
unless  it  has  reversals,  whereupon  it  becomes  space-contracting. 

The  flexible  beta  method  is  space-contracting  for  /l  >  0,  space-conserving  for 
/S  =  0,  and  space-dilating  for  /l  <  0.  A  small  degree  of  dilation  may  help  define 
cluster  boundaries,  but  too  much  dilation  may  lead  to  too  many  clusters  in  the  early 
stages.  Thus  the  recommended  value  of  fi  =  —.25  may  represent  a  good  compro¬ 
mise. 


Example  14.3.9b.  To  illustrate  chaining  in  the  single  linkage  method,  consider  the 
data  plotted  in  Figure  14.12  (similar  to  Everitt  1993,  p.  68).  There  are  two  distinct 
clusters,  A  and  C,  with  intervening  points  labeled  B  that  do  not  belong  to  A  or  C. 

In  Figure  14.13,  the  two-cluster  solution  for  single  linkage  clustering  places  C i 
and  C ii  into  one  cluster  and  all  other  points  into  another  cluster.  The  three-cluster 
solution  has  two  clusters  with  C’s  and  a  cluster  with  A’s  and  B' s. 


yi 


Figure  14.12.  Two  distinct  clusters  with  intervening  individuals. 
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Figure  14.13.  Single  linkage  clustering  of  the  data  in  Figure  14.12. 


A  dendrogram  for  average  linkage  clustering  of  the  data  in  Figure  14. 12  is  given  in 
Figure  14. 14.  For  this  data  set,  the  average  linkage  method  is  more  robust  to  chaining. 
The  two-cluster  solution  separates  the  C’s  from  the  A’s  and  B’s.  The  three-cluster 
solution  completely  separates  the  three  groups,  A,  B ,  and  C.  □ 
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Figure  14.14.  Average  linkage  clustering  of  the  data  in  Figure  14.12. 
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14.3.9c  Other  Properties 

The  single  linkage  method  has  been  criticized  by  many  authors  because  of  its  chain¬ 
ing  tendencies  and  its  sensitivity  to  errors  in  distances  between  observations.  On  the 
other  hand,  the  single  linkage  approach  is  better  than  the  other  methods  at  identify¬ 
ing  clusters  that  have  curvy  shapes  instead  of  spherical  or  elliptical  shapes,  and  it  is 
somewhat  robust  to  outliers  in  the  data. 

Ward’s  method  and  the  average  linkage  method  are  also  relatively  insensitive  to 
outliers.  For  example,  in  the  average  linkage  method,  outliers  tend  to  remain  isolated 
in  the  early  stages  and  to  join  with  other  outliers  rather  than  to  join  with  large  clusters 
or  with  less  compact  clusters.  This  is  due  to  two  properties  of  the  average  linkage 
method:  (1)  the  average  distance  between  two  groups  (squared  Euclidean  distance) 
increases  as  the  points  in  the  groups  are  more  spread  out,  and  (2)  the  average  distance 
increases  as  the  size  of  the  groups  increases. 

These  two  properties  of  the  average  linkage  method  are  illustrated  in  one  dimen¬ 
sion  in  Figure  14.15  (similar  to  Jobson  1992,  pp.  524-525),  where  cluster  A  has  one 
point  at  z l  and  cluster  B  has  two  points,  h\  and  hi,  located  at  z.i  —  h  and  z.i  +  h.  The 
average  squared  distance  between  A  and  B  is 

d 2  =  j[(zi  -  Z2  +  h )2  +  (Z]  -Z2-  hf] 

=  \[{zi  -  zi)2  +  h2  +  2h(zi  -  Z2)  +  (zi  -  Z2)2  +  h 2  -  2h(zi  -  22)] 

=  (zi  -Z2)2  +  h2. 

Thus  the  average  distance  between  A  and  B  increases  as  the  spread  of  h\  and  hi 
increases  (that  is,  as  h  increases). 

To  illustrate  the  second  property  of  the  average  linkage  method,  suppose  cluster 
B  in  Figure  14.15  consists  of  a  single  point  located  at  zi-  Then,  the  distance  between 
A  and  B  is  (zi  —  Z2)~,  and  A  is  closer  to  B  than  it  is  if  B  consists  of  two  points. 


Figure  14.15.  Clusters  in  a  single  dimension. 
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The  centroid  method  is  fairly  robust  to  outliers.  Complete  linkage  is  somewhat 
sensitive  to  outliers  and  tends  to  produce  clusters  of  the  same  size  and  shape.  Ward’s 
method  tends  to  yield  spherical  clusters  of  the  same  size. 

Many  studies  conclude  that  the  best  overall  performers  are  Ward’s  method  and  the 
average  linkage  method.  However,  there  seems  to  be  an  interaction  between  meth¬ 
ods  and  data  sets;  that  is,  some  methods  work  better  for  certain  data  sets,  and  other 
methods  work  better  for  other  data  sets. 

A  good  strategy  is  to  try  several  methods.  If  the  results  agree  to  some  extent,  you 
may  have  found  some  natural  clusters  in  the  data. 

14.3.10  Divisive  Methods 

In  the  agglomerative  hierarchical  methods  covered  in  Sections  14.3.2-14.3.9,  we 
begin  with  n  items  and  end  with  a  single  cluster  containing  all  n  items.  As  noted  in 
the  second  paragraph  of  Section  14.3.1,  a  divisive  hierarchical  method  starts  with  a 
single  cluster  of  n  items  and  divides  it  into  two  groups.  At  each  step  thereafter,  one  of 
the  groups  is  divided  into  two  subgroups.  The  ultimate  result  of  a  divisive  algorithm 
is  n  clusters  of  one  item  each.  The  results  can  be  shown  in  a  dendrogram. 

Divisive  methods  suffer  from  the  same  potential  drawback  as  the  agglomerative 
methods — namely,  once  a  partition  is  made,  an  item  cannot  be  moved  into  another 
group  it  does  not  belong  to  at  the  time  of  the  partitioning.  However,  if  larger  clus¬ 
ters  are  of  interest,  then  the  divisive  approach  may  sometimes  be  preferred  over  the 
agglomerative  approach,  in  which  the  larger  clusters  are  reached  only  after  a  large 
number  of  joinings  of  smaller  groups. 

Divisive  algorithms  are  generally  of  two  classes:  monothetic  and  polythetic.  In  a 
monothetic  approach,  the  division  of  a  group  into  two  subgroups  is  based  on  a  single 
variable,  whereas,  the  polythetic  approach  uses  all  p  variables  to  make  the  split. 

If  the  variables  are  binary  (quantitative  variables  can  be  converted  to  binary  vari¬ 
ables),  the  monothetic  approach  can  easily  be  applied.  Division  into  two  groups  is 
based  on  presence  or  absence  of  an  attribute.  The  variable  (attribute)  is  chosen  that 
maximizes  a  chi-square  statistic  or  an  information  statistic;  see  Everitt  (1993,  pp.  87- 
88)  or  Gordon  (1999,  pp.  130-134). 

For  a  monothetic  approach  using  a  quantitative  variable  y,  we  seek  to  maximize 
the  between-group  sum  of  squares, 


SSB  =  m(yl  -  y)2  +  n2(y2  ~  yf, 


where  n  \  and  n2  are  the  two  group  sizes  (with  n  \  +  n2  —  n),  y  j  and  y2  are  the  group 
means,  and  y  is  the  overall  mean  based  on  all  n  observations.  The  sum  of  squares 
SSB  would  be  calculated  for  all  possible  splits  into  two  groups  of  sizes  n  i  and  n2 
and  for  each  of  the  p  variables.  The  final  division  would  be  based  on  the  variable 
that  maximizes  SSB/  ^"=1  (y,  —  y)2. 

For  a  polythetic  approach,  we  consider  a  technique  proposed  by  MacNaughton- 
Smith  et  al.  (1964).  To  divide  a  group,  we  work  with  a  splinter  group  and  the  remain¬ 
der.  We  seek  the  item  in  the  remainder  whose  average  distance  (dissimilarity)  from 
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other  items  in  the  remainder,  minus  its  average  distance  from  items  in  the  splinter 
group,  is  largest.  If  the  largest  difference  is  positive,  the  item  is  shifted  to  the  splinter 
group.  If  the  largest  difference  is  negative,  the  procedure  stops,  and  the  division  is 
complete.  We  can  start  the  splinter  group  with  the  item  that  has  the  largest  average 
distance  from  the  other  items  in  the  group. 

Example  14.3.10.  In  Table  14.3  we  have  the  track  records  of  eight  countries 
(Dawkins  1989).  Based  on  the  distance  matrix  for  these  eight  observations,  the 
average  distance  from  each  observation  to  the  other  seven  observations  is  given 
in  Table  14.4.  Since  USA  has  the  greatest  average  distance  to  the  other  countries, 
USA  becomes  the  first  observation  in  the  splinter  group.  Now,  the  average  dis¬ 
tance  between  each  observation  in  the  remainder  to  the  other  six  observations  in  the 
remainder  is  calculated.  Then  the  (average)  distance  between  USA  and  each  item  in 
the  remainder  is  calculated.  (This  may  be  found  using  the  distance  matrix  since  there 
is  only  one  observation  in  the  splinter  group.)  Finally,  the  difference  between  the 
average  distance  to  the  remainder  and  the  average  distance  to  the  splinter  group  is 
calculated.  The  results  are  in  Table  14.5.  Because  Australia  has  a  positive  difference 
in  Table  14.5,  it  is  added  to  the  splinter  group  with  USA.  This  process  is  repeated 
for  the  six  countries  in  the  remainder;  the  results  are  given  in  Table  14.6.  Since  no 
difference  in  Table  14.6  is  positive,  the  process  stops,  giving  the  following  clusters: 


Table  14.3.  Athletic  Records  for  Eight  Countries 


Country 

1 

2 

3 

4 

5 

6 

7 

8 

Australia 

10.31 

20.06 

44.84 

1.74 

3.57 

13,28 

27.66 

128.30 

Belgium 

10.34 

20.68 

1.73 

3.60 

13.22 

27.45 

129.95 

Canada 

10.17 

20.22 

45.68 

1.76 

3.63 

13.55 

28.09 

130.15 

GDR 

10.12 

20.33 

44.87 

1.73 

3.56 

13.17 

27.42 

129.92 

GB 

10.11 

20.21 

44.93 

1.70 

3.51 

13.01 

27.51 

129.13 

Kenya 

10.46 

20.66 

44.92 

1.73 

3.55 

13.10 

27.80 

129.75 

USA 

9.93 

19.75 

43.86 

1.73 

3.53 

13.20 

27.43 

128.22 

USSR 

10.07 

20.00 

44.60 

1.75 

3.59 

13.20 

27.53 

130.55 

Event:  (1)  100  m  (s),  (2)  200  m  (s),  (3)  400  m  (s),  (4)  800  m  (min),  (5)  1500  m  (min),  (6)  5000  m  (min), 
(7)  10000  m  (min),  (8)  Marathon  (min). 


Table  14.4.  Average  Distance  from  Each  Country  to  the 
Other  Seven 


Country 

Average 

Distance 

Country 

Average 

Distance 

USA 

2.068 

USSR 

1.513 

Aust 

1.643 

Canada 

1.594 

GB 

1.164 

Kenya 

1.156 

GDR 

1.083 

Belgium 

1.160 
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Table  14.5.  Average  Distances  to  Remainder  and  Splinter  Group  for  Seven  Countries 


Country 

Average  Distance 
to  Remainder  (1) 

Average  Distance 
to  Splinter  Group  (2) 

Difference 

(l)-(2) 

Australia 

1.729 

1.126 

.603 

GB 

1.108 

1.504 

-.396 

GDR 

.918 

2.070 

-1.151 

USSR 

1.355 

2.464 

-1.111 

Canada 

1.392 

2.808 

-1.416 

Kenya 

.986 

2.173 

-1.186 

Belgium 

.975 

2.329 

-1.353 

Table  14.6. 

Average  Distances  to  Remainder  and  Splinter  Group  for  Six  Countries 

Country 

Average  Distance 
to  Remainder  (1) 

Average  Distance 
to  Splinter  Group  (2) 

Difference 

d)-(2) 

GB 

1.144 

1.216 

-.072 

GDR 

.767 

1.872 

-1.105 

USSR 

1.169 

2.373 

-1.203 

Canada 

1.249 

2.457 

-1.208 

Kenya 

.865 

1.884 

-1.019 

Belgium 

.813 

2.058 

-1.245 

Ci  =  {USA,  Australia},  C2  =  {GB,  GDR,  USSR,  Canada,  Kenya,  Belgium}.  We 
could  continue  and  divide  C2  into  two  groups  in  the  same  way.  □ 


14.4  NONHIERARCHICAL  METHODS 

In  this  section,  we  discuss  three  nonhierarchical  techniques:  partitioning,  mixtures 
of  distributions,  and  density  estimation.  Among  these  three  methods,  partitioning  is 
the  most  commonly  used. 

14.4.1  Partitioning 

In  the  partitioning  approach,  the  observations  are  separated  into  g  clusters  without 
using  a  hierarchical  approach  based  on  a  matrix  of  distances  or  similarities  between 
all  pairs  of  points.  The  methods  described  in  this  section  are  sometimes  called  opti¬ 
mization  methods  rather  than  partitioning. 

An  attractive  strategy  would  be  to  examine  all  possible  ways  to  partition  n  items 
into  g  clusters  and  find  the  optimal  clustering  according  to  some  criterion.  However, 
the  number  of  possible  partitions  as  given  by  (14.7)  is  prohibitively  large  for  even 
moderate  values  of  n  and  g.  Thus  we  seek  simpler  techniques. 
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14.4.  la  k-Means 

We  now  consider  an  approach  to  partitioning  that  is  usually  called  the  k-means 
method.  (We  will  continue  to  use  the  notation  g  rather  than  k  for  the  number  of 
clusters.)  The  method  allows  the  items  to  be  moved  from  one  cluster  to  another,  a 
reallocation  that  is  not  available  in  the  hierarchical  methods. 

We  first  select  g  items  to  serve  as  seeds.  These  are  later  replaced  by  the  centroids 
(mean  vectors)  of  the  clusters.  There  are  various  ways  we  can  choose  the  seeds:  select 
g  items  at  random  (perhaps  separated  by  a  specified  minimum  distance),  choose  the 
first  g  points  in  the  data  set  (again  subject  to  a  minimum  distance  requirement),  select 
the  g  points  that  are  mutually  farthest  apart,  find  the  g  points  of  maximum  density, 
or  specify  g  regularly  spaced  points  in  a  gridlike  pattern  (these  would  not  be  actual 
data  points). 

For  these  methods  of  selecting  seeds,  the  number  of  clusters,  g,  must  be  speci¬ 
fied.  Alternatively,  a  minimum  distance  between  seeds  may  be  specified,  and  then  all 
items  that  satisfy  this  criterion  are  chosen  as  seeds. 

After  the  seeds  are  chosen,  each  remaining  point  in  the  data  set  is  assigned  to  the 
cluster  with  the  nearest  seed  (based  on  Euclidean  distance).  As  soon  as  a  cluster  has 
more  than  one  member,  the  cluster  seed  is  replaced  by  the  centroid. 

After  all  items  are  assigned  to  clusters,  each  item  is  examined  to  see  if  it  is  closer 
to  the  centroid  of  another  cluster  than  to  the  centroid  of  its  own  cluster.  If  so,  the  item 
is  moved  to  the  new  cluster  and  the  two  cluster  centroids  are  updated.  This  process 
is  continued  until  no  further  improvement  is  possible. 

The  k-means  procedure  is  somewhat  sensitive  to  the  initial  choice  of  seeds.  It 
might  be  advisable  to  try  the  procedure  again  with  another  choice  of  seeds.  If  differ¬ 
ent  initial  choices  of  seeds  produce  widely  different  final  clusters,  or  if  convergence 
is  extremely  slow,  there  may  be  no  natural  clusters  in  the  data. 

The  k-means  partitioning  method  can  also  be  used  as  a  possible  improvement  on 
hierarchical  techniques.  We  first  cluster  the  items  using  a  hierarchical  method  and 
then  use  the  centroids  of  these  clusters  as  seeds  for  a  A: -means  approach,  which  will 
allow  points  to  be  reallocated  from  one  cluster  to  another. 

Example  14.4.1a.  Protein  consumption  in  25  European  countries  for  nine  food 
groups  is  given  in  Table  14.7  (Hand  et  al.  1994,  p.  298).  In  order  to  illustrate  the 
sensitivity  of  the  k-means  clustering  method  to  the  initial  choice  of  seeds,  we  use  the 
following  four  methods  of  choosing  seeds: 

1.  Select  at  random  g  observations  that  are  at  least  a  distance  r  apart. 

2.  Select  the  first  g  observations  that  are  at  least  a  distance  r  apart. 

3.  Select  the  g  observations  that  are  mutually  farthest  apart. 

4.  Use  the  g  centroids  from  the  g-cluster  solution  from  the  average  linkage  (hier¬ 
archical)  clustering  method. 


To  help  choose  g,  the  number  of  clusters,  we  plot  the  first  two  principal  com¬ 
ponents  in  Figure  14.16.  It  appears  that  there  may  be  at  least  five  clusters.  For  the 
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Table  14.7.  Protein  Data 


Country 

Red 

Meat 

White 

Meat 

Eggs 

Milk 

Fish 

Cereals 

Starchy 

Foods 

Nuts 

Fruits/Veg. 

Albania 

10.1 

1.4 

.5 

8.9 

.2 

42.3 

.6 

5.5 

1.7 

Austria 

8.9 

14.0 

4.3 

19.9 

2.1 

28.0 

3.6 

1.3 

4.3 

Belgium 

13.5 

9.3 

4.1 

17.5 

4.5 

26.6 

5.7 

2.1 

4.0 

Bulgaria 

7.8 

6.0 

1.6 

8.3 

1.2 

56.7 

1.1 

3.7 

4.2 

Czech. 

9.7 

11.4 

2.8 

12.5 

2.0 

34.3 

5.0 

1.1 

4.0 

Denmark 

10.6 

10.8 

3.7 

25.0 

9.9 

21.9 

4.8 

.7 

2.4 

E.  Germany 

8.4 

11.6 

3.7 

11.1 

5.4 

24.6 

6.5 

.8 

3.6 

Finland 

9.5 

4.9 

2.7 

33.7 

5.8 

26.3 

5.1 

1.0 

1.4 

France 

18.0 

9.9 

3.3 

19.5 

5.7 

28.1 

4.8 

2.4 

6.5 

Greece 

10.2 

3.0 

2.8 

17.6 

5.9 

41.7 

2.2 

7.8 

6.5 

Hungary 

5.3 

12.4 

2.9 

9.7 

.3 

40.1 

4.0 

5.4 

4.2 

Ireland 

13.9 

10.0 

4.7 

25.8 

2.2 

24.0 

6.2 

1.6 

2.9 

Italy 

9.0 

5.1 

2.9 

13.7 

3.4 

36.8 

2.1 

4.3 

6.7 

Netherlands 

9.5 

13.6 

3.6 

23.4 

2.5 

22.4 

4.2 

1.8 

3.7 

Norway 

9.4 

4.7 

2.7 

23.3 

9.7 

23.0 

4.6 

1.6 

2.7 

Poland 

6.9 

10.2 

2.7 

19.3 

3.0 

36.1 

5.9 

2.0 

6.6 

Portugal 

6.2 

3.7 

1.1 

4.9 

14.2 

27.0 

5.9 

4.7 

7.9 

Romania 

6.2 

6.3 

1.5 

11.1 

1.0 

49.6 

3.1 

5.3 

2.8 

Spain 

7.1 

3.4 

3.1 

8.6 

7.0 

29.2 

5.7 

5.9 

7.2 

Sweden 

9.9 

7.8 

3.5 

24.7 

7.5 

19.5 

3.7 

1.4 

2.0 

Switzerland 

13.1 

10.1 

3.1 

23.8 

2.3 

25.6 

2.8 

2.4 

4.9 

UK 

17.4 

5.7 

4.7 

20.6 

4.3 

24.3 

4.7 

3.4 

3.3 

USSR 

9.3 

4.6 

2.1 

16.6 

3.0 

43.6 

6.4 

3.4 

2.9 

W.  Germany 

11.4 

12.5 

4.1 

18.8 

3.4 

18.6 

5.2 

1.5 

3.8 

Yugosloslavia 

4.4 

5.0 

1.2 

9.5 

.6 

55.9 

3.0 

5.7 

3.2 

first  method,  we  select  five  observations  at  random  that  are  at  least  a  distance  r  =  1 
from  each  other.  The  five  chosen  seeds  are  Ireland,  UK,  Poland,  Greece,  and  East 
Germany.  Using  these  seeds,  the  A-means  method  produced  the  clusters  identified  in 
Table  14.8  along  with  the  distance  of  each  observation  from  its  cluster  centroid. 

To  view  the  clusters,  we  plot  the  first  two  discriminant  functions  (see  Section 
8.4.1)  in  Figure  14.17.  The  first  two  discriminant  functions  show  good  separation  for 
clusters  2,  3,  and  4  but  poor  separation  for  clusters  1  and  5. 

We  now  select  the  first  five  observations  as  clusters  seeds.  With  these  seeds,  the 
A: -means  clustering  method  produced  the  clusters  in  Table  14.9.  The  first  two  dis¬ 
criminant  functions  are  plotted  in  Figure  14.18.  Good  separation  of  clusters  is  seen 
except  for  clusters  2  and  3. 

We  next  choose  as  cluster  seeds  the  five  observations  that  are  mutually  farthest 
apart.  These  seeds  gave  rise  to  the  clusters  in  Table  14.10.  The  first  two  discriminant 
functions  are  plotted  in  Figure  14.19.  Clusters  1,3,  and  4  seem  very  well  separated, 
but  clusters  2  and  5  show  considerable  overlap. 


Country 

Cluster 

Distance 
from  Centroid 

Country 

Cluster 

Distance 
from  Centroid 

Portugal 

1 

1.466 

Sweden 

4 

1.594 

Spain 

1 

1.466 

E.  Germany 

4 

1.966 

Netherlands 

2 

1.123 

Norway 

4 

2.031 

Austria 

2 

1.217 

France 

4 

2.621 

Czech. 

2 

1.385 

Romania 

5 

1.066 

Switzerland 

2 

1.657 

Yugoslavia 

5 

1.701 

Poland 

2 

1.914 

Bulgaria 

5 

1.741 

Ireland 

3 

1.334 

Italy 

5 

2.092 

UK 

3 

1.821 

Hungary 

5 

2.443 

Finland 

3 

2.261 

USSR 

5 

2.613 

Belgium 

4 

1.201 

Albania 

5 

2.725 

W.  Germany 

4 

1.405 

Greece 

5 

2.741 
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z 
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Figure  14.17.  First  two  discriminant  functions  zi  and  z2  f°r  the  clusters  in  Table  14.8. 


Table  14.9.  fc-Means  Cluster  Solution  Using  the  First  Five  Observations  as  Seeds 


Country 

Cluster 

Distance 
from  Centroid 

Country 

Cluster 

Distance 

from  Centroid 

Albania 

1 

.000 

Romania 

4 

1.415 

Netherlands 

2 

.648 

Bulgaria 

4 

1.587 

Austria 

2 

1.000 

Yugoslavia 

4 

1.784 

W.  Germany 

2 

1.087 

Italy 

4 

1.898 

Switzerland 

2 

1.489 

Greece 

4 

2.450 

Belgium 

3 

1.368 

Poland 

5 

1.709 

Sweden 

3 

1.462 

Czech. 

5 

1.956 

Denmark 

3 

1.666 

USSR 

5 

2.218 

Ireland 

3 

1.832 

E.  Germany 

5 

2.285 

Norway 

3 

1.927 

Spain 

5 

2.344 

UK 

3 

2.076 

Hungary 

5 

2.558 

Finland 

3 

2.341 

Portugal 

5 

3.859 

France 

3 

2.629 
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Figure  14.18.  First  two  discriminant  functions  z \  and  zi  for  the  clusters  in  Table  14.9. 


Table  14.10.  &-Means  Cluster  Solution  Using  as  Seeds  the  Five  Observations  Furthest 
Apart 


Country 

Cluster 

Distance 

from  Centroid 

Country 

Cluster 

Distance 

from  Centroid 

Romania 

1 

.601 

France 

2 

2.358 

Yugoslavia 

1 

1.159 

Poland 

2 

2.405 

Bulgaria 

1 

1.435 

UK 

2 

2.537 

Albania 

1 

2.421 

Greece 

3 

1.075 

Hungary 

1 

2.540 

Italy 

3 

1.075 

Belgium 

2 

.956 

Portugal 

4 

1.466 

W.  Germany 

2 

1.012 

Spain 

4 

1.466 

Netherlands 

2 

1.416 

Norway 

5 

1.054 

Austria 

2 

1.663 

Sweden 

5 

1.191 

Czech. 

2 

1.706 

Finland 

5 

1.545 

Switzerland 

2 

1.713 

Denmark 

5 

1.708 

Ireland 

2 

1.839 

USSR 

5 

2.780 

E.  Germany 

2 

2.042 
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Figure  14.19.  First  two  discriminant  functions  z i  and  zi  for  the  clusters  in  Table  14.10. 


Finally,  we  obtain  a  five-cluster  solution  from  average  linkage  and  use  the  cen¬ 
troids  of  these  clusters  as  the  new  seeds.  The  clusters  in  Table  14.11  result.  The  first 
two  discriminant  functions  are  plotted  in  Figure  14.20.  All  five  clusters  are  well  sep¬ 
arated  in  the  first  two  discriminant  functions.  These  clusters  show  some  resemblance 
to  those  in  the  principal  component  plot  given  in  Figure  14.16.  □ 


Table  14.11.  k-Means  Cluster  Solution  Using  Seeds  from  Average  Linkage 


Country 

Cluster 

Distance 

from  Centroid 

Country 

Cluster 

Distance 

from  Centroid 

Romania 

1 

.970 

Norway 

2 

2.287 

Yugoslavia 

1 

1.182 

UK 

2 

2.354 

Bulgaria 

1 

1.339 

France 

2 

2.600 

Albania 

1 

1.970 

Finland 

2 

2.683 

Belgium 

2 

1.152 

Greece 

3 

1.075 

W.  Germany 

2 

1.245 

Italy 

3 

1.075 

Netherlands 

2 

1.547 

Portugal 

4 

1.466 

Sweden 

2 

1.604 

Spain 

4 

1.466 

Ireland 

2 

1.744 

Czech. 

5 

1.337 

Denmark 

2 

1.766 

Poland 

5 

1.579 

Switzerland 

2 

1.831 

USSR 

5 

1.964 

Austria 

2 

2.037 

Hungary 

5 

2.023 

E.  Germany 

2 

2.251 
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Figure  14.20.  First  two  discriminant  functions  zi  and  Z2  for  the  clusters  in  Table  14. 1 1 . 


14.4.1b  Other  Partitioning  Criteria 

We  now  consider  three  partitioning  methods  that  are  not  based  directly  on  the  dis¬ 
tance  from  a  point  to  the  centroid  of  a  cluster.  These  methods  are  based  on  the 
between-cluster  and  within-cluster  sum  of  squares  and  products  matrices  H  and  E 
defined  in  (6.9)  and  (6.10)  for  one-way  MANOVA.  For  well  defined  clusters,  we 
would  like  E  to  be  “small”  and  H  to  be  “large.” 

The  three  criteria  are  as  follows: 


1.  Minimize  tr(E). 

2.  Minimize  |E|. 

3.  Maximize  tr(E_1H). 


Using  criterion  1,  for  example,  we  would  move  an  item  with  observation  vector  y  to 
the  cluster  for  which  tr(E)  is  minimized  after  the  move. 

We  can  express  the  first  criterion  in  two  alternative  forms.  By  (6.10),  we  have 


tr(E)  =  tr 


8  n 

£^(yy  -y«.)(yy  -JO' 


U=l  ;= 1 


Etr 


~~  yi.)(y>7  -  y  o' 


(14.28) 


[by  (2.96)] 
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=  £tr(Ei), 

i 


(14.29) 


where  E;  =  (jij  —  y,- )  (y / /  —  y,- )'  is  the  sum  of  squares  and  products  matrix 

of  deviations  of  observations  from  the  mean  vector  for  the  ith  cluster.  In  (14.28)  we 
use  the  notation  of  Section  6. 1 .2  for  a  balanced  design,  in  which  n  is  the  number  of 
observations  in  each  cluster. 

We  can  write  tr(E;)  in  ( 14.29)  in  the  form 

tr(Ej )  =  tr^(y,7  -  y,-_  )(y,7  -  y,-.)' 
j 

=  J2  tr(y ij  -  JiXyu  -  y,)'  [by  (2.96)] 
j 

-  -  y«.)'(y ij  -  y,-.)  n>y  (2-97)]-  (14-3°) 

j 

Thus  tr(E/ )  is  the  sum  of  the  (squared)  Euclidean  distances  from  the  individual  points 
to  the  centroid  of  the  i  th  cluster. 

A  second  form  of  (14.28)  was  given  by  Seber  (1984,  p.  277)  as 

tr(E)  =  -  J^(y a  -  yunUyik  -  y im).  (14.31) 

^  i  k<m 

Hence  minimizing  tr(E)  is  equivalent  to  minimizing  the  sum  of  squared  Euclidean 
distances  between  all  pairs  of  points  in  a  cluster. 

The  second  criterion,  minimizing  |E|,  is  related  to  A  =  |E|/|E  +  H|  in  (6.13). 
Minimizing  |E|  is  equivalent  to  minimizing  Wilks’  A  for  the  clusters. 

Another  way  to  look  at  minimizing  |E|  is  to  consider  the  effect  of  adding  a  point 
y  to  a  cluster  with  centroid  y.  Let  u  =  y  —  y.  By  (14.28),  E  is  a  sum  of  terms  of  the 
form  iiu'  =  (y  —  y)(y  —  y)'.  Thus  (ignoring  the  change  in  centroid  with  the  added 
observation  y),  the  increase  in  |E|  is 

|E  +  uu'|  -  |E|  =  | E | ( 1  +  u  E_1u)  -  |E|  [by  (2.95)] 

=  lElu'E-V 

Hence,  the  minimum  increase  in  |E|  is  obtained  by  adding  y  to  the  cluster  for  which 
the  standardized  distance  u'E  1  u  of  y  from  y  is  the  smallest.  By  comparison,  the 
tr(E)  criterion  would  add  y  to  the  cluster  for  which  u  u  is  minimum  [see  (14.30)]. 

The  third  criterion,  maximizing  tr(E_1H),  is  related  to  the  Lawley-Hotelling 
statistic  =  tr(E_1H)  =  ^*=1  A,-  in  (6.28),  where  A.j,  X2, ...  ,  Xs  are  the  eigen¬ 
values  of  E_1H  and  s  =  mini p.  g  —  1).  Associated  with  each  /,,  is  the  eigenvector 
a ;  and  the  discriminant  function  Zi  =  a'y  (see  Section  8.4).  The  largest  eigenvalue, 
A-i,  and  the  accompanying  first  discriminant  function,  ~|  =  a^y,  have  the  greatest 
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influence  on  tr(E-1H).  Maximizing  tr(E_1H)  has  the  inclination  to  produce  ellipti¬ 
cal  clusters  of  the  same  size.  These  clusters  would  tend  to  follow  a  straight-line  trend, 
especially  if  the  first  eigenvalue  dominates  the  others.  If  the  initial  clusters  or  seeds 
are  lined  up  in  a  different  direction  than  the  “true  clusters,”  maximizing  tr(E-1H) 
may  not  correct  the  error  in  subsequent  iterations. 

Since  tr(E)  involves  only  the  diagonal  elements,  the  first  criterion  ignores  the 
correlations  and  tends  to  yield  spherical  clusters.  The  second  criterion,  minimizing 
|E|,  takes  correlations  into  account  and  tends  to  produce  elliptical  clusters.  These 
clusters  have  a  tendency  to  be  of  the  same  shape  because  E/v£  is  a  pooled  estimator 
of  the  covariance  matrix.  A  modification  that  may  be  useful  is  nf=1  |E/|,  where  E, 
is  the  error  matrix  for  the  ;th  cluster  [see  (14.29)]. 

Finally,  we  compare  the  three  criteria  in  terms  of  invariance  to  nonsingular  linear 
transformations  \,j  =  Av, ;  +  b,  where  A  is  a  constant  nonsingular  matrix  and  b  is 
a  vector  of  constants.  The  first  criterion,  minimizing  tr(E),  is  not  invariant  to  such 
linear  transformations,  whereas  the  other  two  criteria  are  invariant  to  these  transfor¬ 
mations.  Therefore,  minimizing  tr(E)  will  likely  give  different  partitions  for  the  raw 
data  and  standardized  data. 


14.4.2  Other  Methods 

We  discuss  mixtures  of  distributions  in  Section  14.4.2a  and  density  estimation  in 
Section  14.4.2b. 

14.4.2a  Mixtures  of  Distributions 

In  this  method,  we  assume  the  existence  of  g  distributions  (usually  multivariate  nor¬ 
mal),  and  we  wish  to  assign  each  of  the  n  items  in  the  sample  to  the  distribution  it 
most  likely  belongs  to.  Such  an  approach  is  related  to  classification  analysis  in  Chap¬ 
ter  9.  Along  with  partitioning  in  Section  14.4.1,  this  method  has  the  property  that 
points  can  be  transferred  from  one  cluster  to  another,  but  it  requires  more  assump¬ 
tions  than  partitioning. 

We  define  the  density  of  a  mixture  of  g  distributions  as  the  weighted  average 

8 

h(y)  =  ^a;/(; y,  (14.32) 

i=i 

where  0  <  a,  <  1,  =  U  and  /( y,  fx,.  2,-)  is  the  multivariate  normal  distri¬ 

bution  Np  (Hi ,  2  ,)  given  in  (4.2). 

Clusters  could  be  formed  in  two  ways.  The  first  approach  is  to  assign  an  item  with 
observation  vector  y  to  the  cluster  C;  with  largest  value  of  the  estimated  posterior 
probability 


P(Ci  |y)  = 


Q7 /( y,  AuS,-) 

My) 


(14.33) 
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[see  Rencher  (1998,  Sections  6.2.4  and  6.3.1)],  where  a;,  fa,  and  X;  are  maximum 
likelihood  estimates  and  h(y)  is  given  by  (14.32)  with  estimates  inserted  for  param¬ 
eters.  The  posterior  probability  (14.33)  is  an  estimate  of  the  probability  that  an  item 
with  observation  vector  y  belongs  to  the  i  th  cluster,  C; . 

The  second  approach  is  to  assign  an  item  with  observation  vector  y  to  the  cluster 
with  largest  value  of 

In  a;  -  4  In  |X;  |  -  |(y-  faftf1  (y  -  fa)  (14.34) 

[see  (9.14)].  For  either  of  these  approaches  [based  on  (14.33)  or  (14.34)],  we  need  the 
estimates  a; ,  fa ,  and  X; .  These  estimates  are  obtained  by  maximizing  the  likelihood 
function  L  =  ]~["= ,  h( y;),  where  h(\j)  is  given  by  (14.32).  The  results  are 

fa  =  -f'P(Ci \yj),  i  =  1,2,...  ,g-l, 

n  4— ( 
j= 1 

1  " 

fa  =  —J2yjP(Q\yj),  i  =  i,2, ...  ,g, 

noii 

%  =  -4-  J^(y j  -  fa)(y j  -  fa)' P(Ci\yj),  i  =  1,2, ...  ,g 

lldi  j=l 

(Everitt  1993,  p.  Ill),  where  /  ( C; | y; )  is  given  by  (14.33).  These  three  equations 
must  be  solved  iteratively.  For  a  given  value  of  g ,  we  can  begin  with  initial  estimates 
or  guesses  for  the  parameters  and  adjust  them  by  iteration  (this  approach  is  related 
to  the  EM  algorithm  mentioned  in  Section  3.11).  If  g  is  not  known,  we  can  begin 
with  g  —  1  and  then  successively  try  g  —  2,  g  —  3,  and  so  on,  until  the  results  are 
satisfactory. 

The  total  number  of  parameters  to  be  estimated  is  large.  There  are  p  parameters 
in  each  /x,.  jp( p  +  1)  unique  parameters  in  each  X;,  and  g  —  1  values  of  a ;  (the 
remaining  a;  is  found  by  fa  =  1),  for  a  total  of 

{g(p+l)(p  +  2)-\  (14.35) 

parameters.  If  the  sample  size  n  is  not  sufficiently  large  to  estimate  all  of  these  param¬ 
eters,  we  could  assume  a  common  covariance  matrix  X,  which  reduces  the  number 
of  parameters  by  j(g  —  1  )p(p  +  1). 

The  method  of  mixtures  is  invariant  to  full-rank  linear  transformations  and  is 
somewhat  robust  to  the  assumption  of  normality.  The  technique  works  better  if  the  g 
densities  are  well  separated  or  the  sample  sizes  are  large. 

Example  14.4.2a.  To  illustrate  the  clustering  method  based  on  mixtures  of  distribu¬ 
tions,  we  use  the  protein  consumption  data  of  Table  14.7.  Because  of  the  small  num¬ 
ber  of  countries  in  the  data  set,  there  are  not  enough  degrees  of  freedom  to  estimate 
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a  different  covariance  matrix  for  each  cluster.  Hence  we  assume  equal  covariance 
matrices  and  estimate  a  pooled  covariance  matrix  X-  For  illustration  purposes,  we 
choose  g  =  5,  as  in  Example  14.4.1a. 

We  use  the  five  clusters  found  by  Ward’s  method  to  obtain  initial  estimates  of  a,-, 
fii,  and  2.  Then  the  maximum  likelihood  equations  are  solved  iteratively  to  find  the 
following  estimates: 


at  =  • 

2801, 

a2  =  .3200, 
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.167 

.111 

.839 

1.653 

.929 

-4.250 

-.699 

-.011 

.167 

8.412 

-.777 

1.708 

.137 

-.157 

1.245 

.301 

1.313 

.111 

-.777 

1.634 

-.845 

-.208 

.287 

-2.903 

-.256 

-.801 

.839 

1.708 

-.845 

2.053 

.503 

\ 

.035 

-.319 

-.008 

.032 

1.653 

.137 

-.208 

.503 

1.808/ 

Then  assigning  each  country  to  the  cluster  for  which  it  has  the  highest  posterior 
probability  of  membership  as  in  (14.33)  yields  the  following  clusters: 


Cluster  1 

Cluster  2 

Cluster  3 

Cluster  4 

Cluster  5 

Albania,  Czech., 
Greece, 

Hungary,  Italy, 
Poland,  USSR 

Austria, 

Belgium,  France, 
Ireland, 

Netherlands, 
Switzerland,  UK, 

W.  Germany 

Bulgaria, 

Romania, 

Yugoslavia 

Denmark, 

Finland, 

Norway, 

Sweden 

E.  Germany, 
Portugal,  Spain 

□ 
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14.4.2b  Density  Estimation 

In  the  method  of  density  estimation,  or  density  searching,  we  seek  regions  of  high 
density  sometimes  called  modes.  No  assumption  is  made  about  the  form  of  the  den¬ 
sity,  as  was  done  in  Section  14.4.2a.  We  could  estimate  the  density  using  a  kernel 
function  as  in  Section  9.7.2.  Alternatively,  we  simply  attempt  to  separate  regions 
with  a  high  concentration  of  points  from  regions  with  a  low  density. 

To  find  regions  of  high  density,  we  first  choose  a  radius  r  and  a  value  of  k,  the 
number  of  points  in  a  ^-nearest  neighbor  scheme.  For  each  of  the  n  points  in  the 
data,  the  number  of  points  within  a  sphere  of  radius  r  is  found.  A  point  is  called  a 
dense  point  if  at  least  k  other  points  are  contained  in  its  sphere. 

If  a  dense  point  is  more  than  a  distance  r  from  all  other  dense  points,  it  becomes 
the  nucleus  of  a  new  cluster.  If  a  dense  point  is  within  a  distance  r  from  at  least  one 
dense  point  that  belongs  to  a  cluster,  it  is  added  to  the  cluster.  If  the  dense  point  is 
within  a  distance  r  of  two  or  more  clusters,  these  clusters  are  combined.  Two  clusters 
are  also  combined  if  the  smallest  distance  between  their  dense  points  is  less  than  the 
average  of  the  2k  smallest  distances  between  the  original  n  points.  The  value  of  r 
can  be  gradually  increased  so  that  more  points  become  dense.  Another  option  is  to 
begin  with  the  specified  value  of  r  for  each  point  and  then  gradually  increase  r  until 
k  observations  are  contained  in  its  sphere. 


Example  14.4.2b.  To  illustrate  the  density  estimation  method,  we  use  the  protein 
data.  For  each  pair  of  values  of  k  and  r,  the  value  of  r  was  allowed  to  increase 
if  needed,  as  described  above.  For  the  following  values  of  k  and  r,  the  number  of 
clusters  obtained  are  given. 


k/r  1.6  1.7 

1.8  1.9  2.0 

2.7  2.8 

2  5  5 

3  3  3 

4  3  3 

5  4  4 

3  3  3 

3  3  3 

4  4  4 

3  3  3 

3  3  3 

3  3  3 

2  2  2 

2  2  2 

3  3 

2  2 

2  2 

The  five-cluster  solution  found  by  setting  r  =  1.8  and  k  =  2  is 

Cluster  1 

Cluster  2 

Cluster  3 

Cluster  4 

Cluster  5 

Austria,  Belgium 
France, 

Ireland. 

Netherlands, 

Switzerland, 

UK.  W.  Germany 

Denmark, 

Finland, 

Norway, 

Sweden 

Albania, 

Bulgaria, 

Hungary, 

Romania, 

Yugoslavia 

Czech., 

E.  Germany 
Poland, 

USSR 

Greece, 

Italy, 

Portugal, 

Spain 

This  partitioning  into  five  clusters  is  perhaps  more  reasonable  than  that  found 
in  Example  14.4.2a.  The  first  two  discriminant  functions  for  these  five  clusters  are 
plotted  in  Figure  14.21.  □ 
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Figure  14.21.  First  two  discriminant  functions  for  the  clusters  found  in  Example  14.4.2b. 


14.5  CHOOSING  THE  NUMBER  OF  CLUSTERS 

In  hierarchical  clustering,  we  can  select  g  clusters  from  the  dendrogram  by  cutting 
across  the  branches  at  a  given  level  of  the  distance  measure  used  by  one  of  the  axes. 
This  is  illustrated  in  Figure  14.22,  which  is  the  dendrogram  for  the  average  linkage 
method  (Section  14.3.4)  applied  to  the  city  crime  data  in  Table  14.1  (see  Figure 
14. 16).  Cutting  the  dendrogram  at  a  level  of  700  yields  two  clusters.  Cutting  it  at  535 
gives  three  clusters. 

We  wish  to  determine  the  value  of  g  that  provides  the  best  lit  to  the  data.  One 
approach  is  to  look  for  large  changes  in  distances  at  which  clusters  are  formed. 
For  example,  in  Figure  14.22,  the  largest  change  in  levels  occurs  in  going  from  two 
clusters  to  a  single  cluster.  The  change  in  distance  between  the  two-cluster  solution 
and  the  three-cluster  solution  is  82  units  squared.  The  difference  between  the  three- 
cluster  solution  and  the  four-cluster  solution  is  73  units  squared,  and  the  change 
between  the  four-  and  five-cluster  solutions  is  only  26  units  squared.  In  this  case  we 
would  choose  two  clusters. 

A  formalization  of  this  procedure  was  proposed  by  Mojena  (1977):  choose  the 
number  of  groups  given  by  the  first  stage  in  the  dendrogram  at  which 

aj  >  a  +  ksa,  7  =  1,2,...  ,  n,  (14.36) 

where  ai,  ui, . . .  ,  a„  are  the  distance  values  for  stages  with  n,  n  —  1, . . .  ,1  clusters, 
a  and  sa  are  the  mean  and  standard  deviation  of  the  «\s,  and  k  is  a  constant.  Mojena 
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City 

New  York 
Los  Angeles 
Portland 

Honolulu 

Denver 
Washington 
Detroit 
Kansas  City 
Houston 

Dallas 
Chicago 

Boston 
New  Orleans 

Hartford 

Tucson 

Atlanta 

1000  800  600  400  200  0 

Average  distance  between  clusters 

Figure  14.22.  Cutting  the  dendrogram  to  choose  the  number  of  clusters. 

(1977)  suggested  using  a  value  of  k  in  the  range  2.75  to  3.5,  but  Milligan  and  Cooper 
(1985)  recommended  k  =  1.25,  based  on  a  simulation  study. 

An  index  that  can  be  used  with  either  hierarchical  or  partitioning  methods  is 

=  tr(H)/(g  -  1) 
tr(E)/(«  -  g) ' 


(14.37) 
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The  value  of  g  that  maximizes  c  is  chosen.  A  related  approach  is  to  choose  the  value 
of  g  that  minimizes 


d  =  g1 2 3  |E|.  (14.38) 

To  compare  two  cluster  solutions  with  gi  and  g2  clusters,  where  g2  >  g t,  we  can 
use  the  test  statistic 


F  = 


tr(E2) 


tr(Ei)  -  tr(E2) 


(14.39) 


which  has  an  approximate  E-distribution  with  p(g2  —  gt)  and  p(n  —  g2)  degrees  of 
freedom  [Beale  (1969)].  The  matrices  Ei  and  E2  are  within-cluster  sums  of  squares 
and  products  matrices  corresponding  to  g]  and  g2.  The  hypothesis  is  that  the  cluster 
solutions  with  g i  and  g2  clusters  are  equally  valid,  and  rejection  implies  that  the 
cluster  solution  with  g2  clusters  is  better  than  the  solution  with  g i  clusters  (g2  >  gi). 
The  E-approximation  in  (14.39)  may  not  be  sufficiently  accurate  to  justify  the  use  of 
/j- values. 


14.6  CLUSTER  VALIDITY 

To  check  the  validity  of  a  cluster  solution,  it  may  be  possible  to  test  the  hypothesis 
that  there  are  no  clusters  or  groups  in  the  population  from  which  the  sample  at  hand 
was  taken.  For  example,  the  hypothesis  could  be  that  the  population  represents  a 
single  unimodal  distribution  such  as  the  multivariate  normal,  or  that  the  observations 
arose  from  a  uniform  distribution.  Formal  tests  of  hypotheses  of  this  type  concerning 
cluster  validity  are  reviewed  by  Gordon  (1999,  Section  7.2). 

A  cross-validation  approach  can  also  be  used  to  check  the  validity  or  stability  of  a 
clustering  result.  The  data  are  randomly  divided  into  two  subsets,  say  A  and  B ,  and 
the  cluster  analysis  is  carried  out  separately  on  each  of  A  and  B.  The  results  should 
be  similar  if  the  clusters  are  valid.  An  alternative  approach  is  the  following  (Gordon 
1999,  Section  7.1;  Milligan  1996): 

1.  Use  some  clustering  method  to  partition  subset  A  into  g  clusters. 

2.  Partition  subset  B  into  g  clusters  in  two  ways: 

(a)  Assign  each  item  in  B  to  the  cluster  in  A  that  it  is  closest  to  by  using,  for 
example,  the  distance  to  cluster  centroids. 

(b)  Use  the  same  clustering  method  on  B  that  was  used  on  A. 

3.  Compare  the  results  of  (a)  and  (b)  in  step  2. 
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14.7  CLUSTERING  VARIABLES 

In  some  cases,  it  may  be  of  interest  to  cluster  the  p  variables  rather  than  the  n  obser¬ 
vations.  For  a  similarity  measure  between  each  pair  of  variables,  we  would  usually 
use  the  correlation.  Since  most  clustering  methods  use  dissimilarities  (such  as  dis¬ 
tances),  we  need  to  convert  the  correlation  matrix  R  =  (r,;  )  to  a  dissimilarity  matrix. 
This  can  conveniently  be  done  by  replacing  each  r,j  by  1  —  |  rij  |  or  1  —  r?..  Using  the 
resulting  dissimilarity  matrix,  we  can  apply  a  clustering  method  such  as  a  hierarchi¬ 
cal  technique  to  cluster  the  variables. 


Variables 


Larceny 


Burglary 


Auto  theft 


Robbery 


Assault 


Rape 


Murder 


1.0  0.8  0.6  0.4  0.2  0.0 

Average  distance  between  clusters 

Figure  14.23.  Dendrogram  for  clustering  the  variables  of  Table  14.1  using  average  linkage 
(see  Example  14.7). 
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Variables 


Larceny 


Burglary 


Auto  theft 


Robbery 


Assault 


Rape 


Murder 


0.8  0.6  0.4  0.2  0.0 

Between  clusters  sum  of  squares 

Figure  14.24.  Dendrogram  for  clustering  the  variables  of  Table  14.1  using  Ward's  method 
(see  Example  14.7). 


Clustering  of  variables  can  sometimes  be  done  successfully  with  factor  analysis, 
which  groups  the  variables  corresponding  to  each  factor;  see  Sections  13.1  and  13.5. 


Example  14.7.  We  illustrate  clustering  of  variables  using  the  city  crime  data  in  Table 
14.1.  We  first  calculate  the  correlation  matrix  R  =  (r/j)  and  then  transform  R  to  a 
dissimilarity  matrix  D  =  (1  —  r  ?• ) .  The  variables  are  then  clustered  using  both  aver¬ 
age  linkage  and  Ward’s  clustering  methods,  and  the  dendrograms  are  given  in  Figures 
14.23  and  14.24,  respectively.  Both  clustering  methods  yield  the  same  solution. 
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Table  14.12.  Rotated  Factor  Loadings  for  City 
Crime  Data 


Variables 

Factor  1 

Factor  2 

Factor  3 

Murder 

-.063 

.734 

.142 

Rape 

.504 

.659 

.160 

Robbery 

.133 

.355 

.726 

Assault 

.298 

.740 

.398 

Burglary 

.764 

.221 

.181 

Larceny 

.847 

-.014 

.244 

Auto  theft 

.240 

.097 

.584 

We  next  carry  out  a  factor  analysis  of  the  data  and  compare  the  resulting  groups 
of  variables  with  the  clusters  obtained  with  the  average  linkage  and  Ward’s  methods. 
The  factor  loadings  are  estimated  using  the  principal  factor  method  (Section  13.3.2) 
with  squared  multiple  correlations  as  initial  communality  estimates,  and  the  loadings 
are  then  rotated  with  a  varimax  rotation  (Section  13.5.2b).  The  rotated  factor  pattern 
is  given  in  Table  14.12.  The  highest  loading  in  each  row  is  bolded.  The  first  factor 
deals  with  crimes  associated  with  the  home.  The  second  factor  involves  crimes  that 
are  violent  in  nature.  The  third  factor  consists  of  crimes  of  theft  outside  the  home. 
Note  that  the  three-cluster  solutions  found  by  both  average  linkage  and  Ward’s  meth¬ 
ods  are  identical  to  the  grouping  of  variables  in  the  factor  analysis  solution,  namely, 
(1)  murder,  rape,  and  assault,  (2)  robbery  and  auto  theft,  and  (3)  burglary  and  larceny. 
Since  all  three  methods  agree,  we  have  some  confidence  in  the  validity  of  the  solu¬ 
tion.  □ 


PROBLEMS 

14.1  Show  that  d2(x,  y)  =  Yfj=\(xj  ~  T/)2  from  (14.2)  is  equal  to  (14.5), 
d2(x,  y)  =  (vx  -  vy)2  +  p(x  -  y)2  +  2vxvy(l  -  rxy),  where  v2  — 
Yl!j=i (xj  —  *)2-  *  =  J2Pj=i  xjl P ’  an(l  ryx  Is  defined  in  (14.6). 

14.2  (a)  Show  that /as  =nA(yA-yAB)'(yA -Fab) +«s(ys-yAB)'(ys-yAB) 

as  in  (14.18). 

(b)  Show  that  (14.18)  is  equal  to  (14.19);  that  is, 


«a( yA  -  y ab)' (y a  -  y ab)  +  »*( y^  -  yAB)'(ys  -  y ab) 


HAUB 

n  a  +  nB 


(yA  -yfi)'(yA  -y  «)• 


14.3  Using  the  hints  provided  in  each  case,  show  that  the  parameter  values  for 
(14.20)  in  Table  14.2  produce  appropriate  distances  for  the  following  cluster 
methods. 
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(a)  Complete  linkage.  Use  an  approach  analogous  to  that  in  Section  14.3.8 
for  the  single  linkage  method. 

(b)  Average  linkage.  Write  (14.20)  in  terms  of  parameter  values  for  average 
linkage  in  Table  14.2.  Then  use  (14.9). 

(c)  Centroid  method.  Show  that 


(yc  -yab)  (yc-yAB)  = 


nA  -  -  !  -  - 

— — (yc  -y^)  (yc  -yA) 

t  +  nB 

II  b  _  _  ,  _  _ 

h - : — (yc  —  y b)  (yc  —  y b) 

(ja  -  y«)'(yA  -  y«). 


nA  +nB 
nAnB 


(nA  +  nB)2 


(14.40) 


where  yAB  =  (nA yA  +  nByB)/(nA  +  nB). 

(d)  Median  method.  Use  nA  =  nB  in  (14.12)  and  (14.40)  [see  part  (c)]. 

(e)  Ward’s  method.  Show  that 


IC(AB) 


nA  +  nc 
ha  +  nB  +  nc 


Iac+ 


nB  +  nc 
nA  +  nB  +  nc 


he¬ 


ir  c 


nA  +  nB  +  nc 


-Iab, 


where  IAB  is  defined  in  (14.17). 

14.4  Show  that  for  all  methods  in  Table  14.2  for  which  y  —  0,  we  have  D(C,  AB)  > 
0 aA  +  ofg  +  )D(A,  B)  as  in  (14.26). 

14.5  Verify  the  statement  in  the  last  paragraph  of  Section  14.4.1b,  namely,  that  the 
first  criterion  in  Section  14.4.1b  is  not  invariant  to  nonsingular  linear  trans¬ 
formations  \ij  =  Ay ij  +  b,  where  A  is  a  p  x  p  nonsingular  matrix,  and  that 
the  other  two  criteria  are  invariant  to  such  transformations.  Use  the  following 
approach: 

(a)  Show  that  H „  =  AHyA'  and  E,;  =  AEVA'. 

(b)  Show  that  minimizing  tr(E)  is  not  invariant. 

(c)  Show  that  minimizing  |E|  is  invariant. 

(d)  Show  that  maximizing  tr(E  1 H)  is  invariant. 

14.6  Verify  the  statement  in  Section  14.4.2a  that  in  /a,/,  i  =  1,  2, ...  ,  g\  Xf, 
i  =  1,  2, . . .  ,  g;  and  a,-,  i  =  1,  2, . . .  ,  g  —  1;  the  total  number  of  param¬ 
eters  is  given  by  jg(p  +  1  )(p  +  2)  —  1  as  in  (14.35). 

14.7  Use  the  ramus  bone  date  of  Table  3.6.  Carry  out  the  following  cluster  methods 
and  compare  to  the  principal  component  plot  in  Figure  12.5. 

(a)  Find  a  two-cluster  solution  using  the  single  linkage  method. 

(b)  Find  a  two-cluster  solution  using  the  average  linkage  method  and  com¬ 
pare  to  the  result  in  (a).  Which  seems  better? 
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(c)  Carry  out  a  cluster  analysis  using  the  Ward’s,  complete  linkage,  centroid, 
and  median  methods. 

(d)  Use  the  flexible  beta  method  with  /3  =  —.25,  /3  =  —.5,  and  /3  =  —.75. 

14.8  Use  the  hematology  data  of  Table  4.3. 

(a)  Carry  out  a  cluster  analysis  using  the  centroid  method  and  find  the  dis¬ 
tance  between  the  centroids  of  the  two-cluster  solution. 

(b)  Carry  out  a  cluster  analysis  using  the  average  linkage  method.  How  many 
clusters  are  indicated  in  the  dendrogram? 

(c)  Using  the  two-cluster  solution  from  part  (b),  label  observations  from  one 
cluster  as  group  1  and  the  observations  from  the  other  cluster  as  group  2. 
Calculate  and  plot  the  discriminant  function,  as  in  Example  8.2.  Do  the 
two  clusters  overlap? 

14.9  Use  all  the  variables  of  the  Seishu  data  of  Table  7.1. 

(a)  Find  the  three-cluster  solution  using  the  single  linkage,  complete  linkage, 
average  linkage,  centroid,  median,  and  Ward’s  methods.  Which  observa¬ 
tion  appears  to  be  an  outlier?  Which  cluster  is  the  same  in  all  six  solu¬ 
tions? 

(b)  Using  the  cluster  found  in  part  (a)  to  be  common  to  all  solutions  as  group 
1  and  the  rest  of  the  observations  as  group  2,  calculate  and  plot  the  dis¬ 
criminant  function,  as  in  Problem  14.8(c).  Do  the  two  clusters  overlap? 

14.10  Use  the  first  20  observations  of  the  temperature  data  of  Table  7.2.  Standardize 

the  variables  (columns)  before  doing  the  following: 

(a)  Carry  out  a  Umcans  cluster  analysis  using  as  initial  seeds  the  five  obser¬ 
vations  that  are  mutually  farthest  apart.  Plot  the  first  two  discriminant 
functions  using  the  five  clusters  as  groups. 

(b)  Repeat  part  (a)  using  the  first  five  observations  as  initial  seeds. 

(c)  Repeat  part  (a)  using  as  initial  seeds  the  centroids  of  the  five-cluster 
solution  found  using  Ward’s  method.  Plot  the  dendrogram  resulting  from 
Ward's  method. 

(d)  Repeat  part  (c)  using  average  linkage  instead  of  Ward’s  method.  Compare 
the  results  with  those  in  part  (c). 

(e)  Plot  the  first  and  second  principal  components  and  the  second  and  third 
components.  Which  cluster  solutions  found  in  parts  (a)-(d)  seem  to  agree 
most  with  the  principal  component  plots? 

(f)  Repeat  parts  (a)  and  (b)  using  three  initial  seeds  instead  of  five.  How  do 
the  cluster  solutions  compare? 

(g)  Repeat  part  (c)  using  three  initial  seeds  instead  of  five.  How  does  the 
cluster  solution  compare  to  your  answer  in  part  (f)? 

14.11  Table  14.13  contains  air  pollution  data  from  41  U.S.  cities  (Sokal  and  Rohlf 

1981,  p.  619).  The  variables  are  as  follows: 
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Table  14.13.  Air  Pollution  Levels  in  U.S.  Cities 


Cities 

Vi 

yi 

Jb 

T4 

>b 

V6 

D 

Phoenix 

10 

70.3 

213 

582 

6.0 

7.05 

36 

Little  Rock 

13 

61.0 

91 

132 

8.2 

48.52 

100 

San  Francisco 

12 

56.7 

453 

716 

8.7 

20.66 

67 

Denver 

17 

51.9 

454 

515 

9.0 

12.95 

86 

Hartford 

56 

49.1 

412 

158 

9.0 

43.37 

127 

Wilmington 

36 

54.0 

80 

80 

9.0 

40.25 

114 

Washington 

29 

57.3 

434 

757 

9.3 

38.89 

111 

Jacksonville 

14 

68.4 

136 

529 

0° 

oo 

54.47 

116 

Miami 

10 

75.5 

207 

335 

9.0 

59.80 

128 

Atlanta 

24 

61.5 

368 

497 

9.1 

48.34 

115 

Chicago 

110 

50.6 

3344 

3369 

10.4 

34.44 

122 

Indianapolis 

28 

52.3 

361 

746 

9.7 

38.74 

121 

Des  Moines 

17 

49.0 

104 

201 

11.2 

30.85 

103 

Wichita 

8 

56.6 

125 

277 

12.7 

30.58 

82 

Louisville 

30 

55.6 

291 

593 

8.3 

43.11 

123 

New  Orleans 

9 

68.3 

204 

361 

8.4 

56.77 

113 

Baltimore 

47 

55.0 

625 

905 

9.6 

41.31 

111 

Detroit 

35 

49.9 

1064 

1513 

10.1 

30.96 

129 

Minneapolis-St.  Paul 

29 

43.5 

699 

744 

10.6 

25.94 

137 

Kansas  City 

14 

54.5 

381 

507 

10.0 

37.00 

99 

St.  Louis 

56 

55.9 

775 

622 

9.5 

35.89 

105 

Omaha 

14 

51.5 

181 

347 

10.9 

30.18 

98 

Albuquerque 

11 

56.8 

46 

244 

8.9 

7.77 

58 

Albany 

46 

47.6 

44 

116 

OO 

oo 

33.36 

135 

Buffalo 

11 

47.1 

391 

463 

12.4 

36.11 

166 

Cincinnati 

23 

54.0 

462 

453 

7.1 

39.04 

132 

Cleveland 

65 

49.7 

1007 

751 

10.9 

34.99 

155 

Columbus 

26 

51.5 

266 

540 

8.6 

37.01 

134 

Philadelphia 

69 

54.6 

1692 

1950 

9.6 

39.93 

115 

Pittsburgh 

61 

50.4 

347 

520 

9.4 

36.22 

147 

Providence 

94 

50.0 

343 

179 

10.6 

42.75 

125 

Memphis 

10 

61.6 

337 

624 

9.2 

49.10 

105 

Nashville 

18 

59.4 

275 

448 

7.9 

46.00 

119 

Dallas 

9 

66.2 

641 

844 

10.9 

35.94 

78 

Houston 

10 

68.9 

721 

1233 

10.8 

48.19 

103 

Salt  Lake  City 

28 

51.0 

137 

176 

8.7 

15.17 

89 

Norfolk 

31 

59.3 

96 

308 

10.6 

44.68 

116 

Richmond 

26 

57.8 

197 

299 

7.6 

42.59 

115 

Seattle 

29 

51.1 

379 

531 

9.4 

38.79 

164 

Charleston 

31 

55.2 

35 

71 

6.5 

40.75 

148 

Milwaukee 

16 

45.7 

569 

717 

11.8 

29.07 

123 
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V|  =  SO2  content  of  air  in  micrograms  per  cubic  meter 
v2  =  Average  annual  temperature  in  °F 

V3  =  Number  of  manufacturing  enterprises  employing  20  or  more  workers 

y4  =  Population  size  (1970  census)  in  thousands 

y5  =  Average  annual  wind  speed  in  miles  per  hour 

y6  =  Average  annual  precipitation  in  inches 

y-j  =  Average  number  of  days  with  precipitation  per  year 

Standardize  each  variable  to  mean  0  and  standard  deviation  1.  Carry  out  a 
cluster  analysis  using  the  density  estimation  method  with  k  equal  to  2,  3,  4,  5 
and  values  of  r  ranging  from  .2  to  2  by  increments  of  .2  for  each  value  of  k. 
What  is  the  maximum  value  of  k  that  produces  a  two-cluster  solution? 

14.12  Table  14.14  gives  the  yields  of  winter  wheat  in  each  of  the  years  1970-1973 
at  12  different  sites  in  England  (Hand  et  al.  1994,  p.  31). 

(a)  Carry  out  a  cluster  analysis  using  the  density  estimation  method  with  k  = 
2,  3,  4  and  r  =  .2,  .4, . . .  ,  2.0. 

(b)  Plot  the  first  two  discriminant  functions  from  the  three-cluster  solution 
obtained  with  k  =  2  and  r  =  1 . 

(c)  Plot  the  first  two  principal  components  and  compare  with  the  plot  in 
part  (b). 

(d)  Repeat  part  (b)  using  a  two-cluster  solution  obtained  with  k  =  3  and 
r  =  1 .  Which  two  clusters  of  the  three-cluster  solution  found  in  part  (b) 
merged  into  one  cluster? 


Table  14.14.  Yields  of  Winter  Wheat  (kg  per  unit  area) 


Site 

Year 

1970 

1971 

1972 

1973 

Cambridge 

46.81 

39.40 

55.64 

32.61 

Cockle  Park 

46.49 

34.07 

45.06 

41.02 

Harpers  Adams 

44.03 

42.03 

40.32 

50.23 

Headley  Hall 

52.24 

36.19 

47.03 

34.56 

Morley 

36.55 

43.06 

38.07 

43.17 

Myerscough 

34.88 

49.72 

40.86 

50.08 

Rosemaund 

56.14 

47.67 

43.48 

38.99 

Seale-Hayne 

45.67 

27.30 

45.48 

50.32 

Sparsholt 

42.97 

46.87 

38.78 

47.49 

Sutton  Bonington 

54.44 

49.34 

24.48 

46.94 

Terrington 

54.95 

52.05 

50.91 

39.13 

Wye 

48.94 

48.63 

31.69 

59.72 

CHAPTER  15 


Graphical  Procedures 


In  Sections  15.1,  15.2,  and  15.3,  we  consider  three  graphical  techniques:  multidimen¬ 
sional  scaling,  correspondence  analysis,  and  biplots.  These  methods  are  designed  to 
reduce  dimensionality  and  portray  relationships  among  observations  or  variables. 


15.1  MULTIDIMENSIONAL  SCALING 

15.1.1  Introduction 

In  the  dimension  reduction  technique  called  multidimensional  scaling,  we  begin  with 
the  distances  Sij  between  each  pair  of  items.  We  wish  to  represent  the  n  items  in  a 
low-dimensional  coordinate  system,  in  which  the  distances  djj  between  items  closely 
match  the  original  distances  <5 tj ,  that  is, 

dij  =  8jj  for  all  i,  j. 

The  final  distances  dij  are  usually  Euclidean.  The  original  distances  Sjj  may  be  actual 
measured  distances  between  observations  y,-  and  y  j  in  p  dimensions,  such  as 

Sij  =  [(y;  -  y,-)'(y;  -  y7)]1/2-  (15.D 

On  the  other  hand,  the  distances  8jj  may  be  only  a  proximity  or  similarity  based  on 
human  judgment — for  example,  the  perceived  degree  of  similarity  between  all  pairs 
of  brands  of  a  certain  type  of  appliance  (for  a  discussion  of  similarities  and  dissimi¬ 
larities,  see  Section  14.2).  The  goal  of  multidimensional  scaling  is  a  plot  that  exhibits 
information  about  how  the  items  relate  to  each  other  or  provides  some  other  mean¬ 
ingful  interpretation  of  the  data.  For  example,  the  aim  may  be  seriation  or  ranking; 
if  the  points  lie  close  to  a  curve  in  two  dimensions,  then  the  ordering  of  points  along 
the  curve  is  used  to  rank  the  points. 

If  the  observation  vectors  y i  =  1,  2 are  available  and  we  calculate 
distances  using  (15.1)  or  a  similar  measure,  or  if  the  original  y,-’s  are  not  available, 
but  we  have  actual  distances  between  items,  then  the  process  of  reduction  to  a  lower 
dimensional  geometric  representation  is  called  metric  multidimensional  scaling.  If 
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the  original  distances  are  only  similarities  based  on  judgment,  the  process  is  called 
nonmetric  multidimensional  scaling ,  and  the  final  spatial  representation  preserves 
only  the  rank  order  among  the  similarities.  We  consider  metric  scaling  in  Section 

15.1.2  and  nonmetric  scaling  in  Section  15.1.3.  For  useful  discussions  of  various 
aspects  of  multidimensional  scaling,  see  Davidson  (1983);  Gordon  (1999,  Sections 

6.2  and  6.3);  Kruskal  and  Wish  (1978);  Mardia,  Kent,  and  Bibby  (1979,  Chapter 
14);  Seber  (1984,  Section  5.5);  Young  (1987);  Jobson  (1992,  Section  10.3);  Shepard, 
Romney,  and  Nerlove  (1972);  and  Romney,  Shepard,  and  Nerlove  (1972). 


15.1.2  Metric  Multidimensional  Scaling 

In  this  section,  we  consider  metric  multidimensional  scaling ,  which  is  also  known  as 
the  classical  solution  and  as  principal  coordinate  analysis.  We  begin  with  an  n  x  n 
distance  matrix  D  =  (<5,y).  Our  goal  is  to  find  n  points  in  k  dimensions  such  that  the 
interpoint  distances  dij  in  the  k  dimensions  are  approximately  equal  to  the  values  of 
8/j  in  D.  Typically,  we  use  k  —  2  for  plotting  purposes,  but  k  =  1  or  3  may  also  be 
useful. 

The  points  are  found  as  follows: 

1.  Construct  the  n  x  n  matrix  A  =  (ajj )  =  (—  |  <$?.),  where  8/j  is  the  i jth  element 
of  D. 

2.  Construct  the  n  x  n  matrix  B  =  ( bjj ),  with  elements  bjj  —  ajj  —  a)  —a.j  +  «.., 
where  =  ]C”=1  aij /n>  Q.j  =  Y^i= l  aij /«>«■•  =  X!/;  aij /,;2-  The  matrix  B 
can  be  written  as 

B  =  ^1  —  ^-J)  A  (*  —  ~J)  '  (15.2) 

It  can  be  shown  that  there  exists  a  q -dimensional  configuration  zi,  Z2, . . .  ,  z„ 
with  interpoint  distances  dij  —  (z ;  —  Zy)'(z,-  —  z j)  such  that  dij  =  8,j  if 
and  only  if  B  is  positive  semidefinite  of  rank  q  (Schoenberg  1935;  Young  and 
Householder  1938;  Gower  1966;  Seber  1984,  p.  236). 

3.  Since  B  is  a  symmetric  matrix,  we  can  use  the  spectral  decomposition  in 
(2.109)  to  write  B  in  the  form 


B  =  VAV'. 


(15.3) 


where  Y  is  the  matrix  of  eigenvectors  of  B  and  A  is  the  diagonal  matrix  of 
eigenvalues  of  B.  If  B  is  positive  semidefinite  of  rank  q ,  there  are  q  pos¬ 
itive  eigenvalues,  and  the  remaining  n  —  q  eigenvalues  are  zero.  If  Ai  = 
diag (A i ,  A.2, . . .  ,  Xq)  contains  the  positive  eigenvalues  and  Vi  =  (vj.vt, 

. . .  ,  v<y )  contains  the  corresponding  eigenvectors,  then  we  can  express  (15.3) 
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in  the  form 


B  =  V|  A|  V, 

=  ViAj/2A[/2V,1 

=  zz'. 


where 


Z  =  YiA[/2  =  (v/^Vl,  v^2V2,  . . .  ,  Tv,) 


( Z1  \ 

Z2 

V  z»  ) 


(15.4) 


4.  The  rows  z  j ,  z!2  , . . .  ,  z!n  of  Z  in  (1 5 .4)  are  the  points  whose  interpoint  distances 
djj  =  (Zj  —  Zj)'(Zj  —  zj)  match  the  Sjj’s  in  the  original  distance  matrix  D,  as 
noted  following  (15.2). 

5.  Since  q  in  (15.4)  will  typically  be  too  large  to  be  of  practical  interest  and 
we  would  prefer  a  smaller  dimension  k  for  plotting,  we  can  use  the  first  k 
eigenvalues  and  corresponding  eigenvectors  in  (15.4)  to  obtain  n  points  whose 
interpoint  distances  dij  are  approximately  equal  to  the  corresponding  5,,  ’s. 

6.  If  B  is  not  positive  semidefinite,  but  its  first  k  eigenvalues  are  positive  and 
relatively  large,  then  these  eigenvalues  and  associated  eigenvectors  may  be 
used  in  (15.4)  to  construct  points  that  give  reasonably  good  approximations  to 
the  Sjj ’s. 


Note  that  the  method  used  to  obtain  Z  from  B  closely  resembles  principal  com¬ 
ponent  analysis.  Note  also  that  the  solution  Z  in  (15.4)  is  not  unique,  since  a  shift  in 
origin  or  a  rotation  will  not  change  the  distances  dj j .  For  example,  if  C  is  a  q  x  q 
orthogonal  matrix  producing  a  rotation  [see  (2.101)],  then 

(Cz i  -  CzjYiCzi  -  Czj )  =  (z i  -  Zj)'C'C(Zi  -  zj) 

—  (z/  -  z/)'(z,-  -  Zj)  [see  (2.103)]. 

Thus  the  rotated  points  Cz,  have  the  same  interpoint  distances  dij. 


Example  15.1.2(a).  To  illustrate  the  first  four  steps  of  the  above  algorithm  for  metric 
multidimensional  scaling,  consider  the  5x5  distance  matrix 

/  o  2V2  2V2  2V2  2V2  \ 

2V2  0  4  4V2  4 

2V2  4  0  4  4V2  . 

2V2  4V2  4  0  4 

v  2V2  4  4V2  4  0  j 


D  =  (Sij)  = 
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The  matrix  A  =  (— is?;)  in  step  1  is  given  by 

/  0  4  4  4  4  \ 

4  0  8  16  8 

A  =  —  4  8  0  8  16  . 

4  16  8  0  8 

v  4  8  16  8  0 

For  the  means,  we  have  a\  =  a j  =  —16/5,  ctj  —  a =  —36/5,  i  =  2,  ....  5, 
a..  =  —32/5.  With  n  —  5,  the  matrix  B  in  step  2  is  given  by 

0  0  0  0  0  \ 

080-80 
0  0  8  0  -8  . 

0-8080 
0  0  -8  0  8  y 

The  rank  of  B  is  clearly  2.  For  step  3,  the  (nonzero)  eigenvalues  and  corresponding 
eigenvectors  of  B  are  given  by  ki  =  16,  'kj  =  16, 

/  0  \ 

VI  =  0  Vt 

4V2 

V  0 

Then  for  step  3  we  have,  by  (15.4), 


0  0  \ 

2V2  0 

0  2V2  . 

2V2  0 

0  -2V2 

It  can  be  shown  (step  4)  that  the  distance  matrix  for  these  five  points  is  D.  The  five 
points  constitute  a  square  with  each  side  of  length  4  and  a  center  point  at  the  origin. 
The  five  points  (rows  of  Z)  are  plotted  in  Figure  15.1.  □ 

Example  15.1.2(b).  For  another  example  of  metric  multidimensional  scaling,  we 
use  airline  distances  between  10  U.S.  cities,  as  given  in  Table  15.1  (Kruskal  and  Wish 
1978,  pp.  7-9).  The  points  given  by  metric  multidimensional  scaling  are  plotted  in 
Figure  15.2.  Notice  that  north  and  south  have  been  reversed;  the  eigenvectors  V;  in 
(15.4)  are  normalized  but  are  subject  to  multiplication  by  —  1 .  □ 
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-3-2-1  0  1  2  3 

Dimension  1 

Figure  15.1.  Plot  of  the  five  points  found  in  Example  15.1.2(a) 


15.1.3  Nonmetric  Multidimensional  Scaling 

Suppose  the  m  —  n(n  —  l)/2  dissimilarities  8,j  cannot  be  measured  as  in  (15.1)  but 
can  be  ranked  in  order. 


Jr\s  i 


<  8 


nsi 


<  ■  ■  ■  <  8r 


(15.5) 


Table  15.1.  Airline  Distances  between  Ten  U.S.  Cities 


City 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

1 

0 

587 

1212 

701 

1936 

604 

748 

2139 

2182 

543 

2 

587 

0 

920 

940 

1745 

1188 

713 

1858 

1737 

597 

3 

1212 

920 

0 

879 

831 

1726 

1631 

949 

1021 

1494 

4 

701 

940 

879 

0 

1374 

968 

1420 

1645 

1891 

1220 

5 

1936 

1745 

831 

1374 

0 

2339 

2451 

347 

959 

2300 

6 

604 

1188 

1726 

968 

2339 

0 

1092 

2594 

2734 

923 

7 

748 

713 

1631 

1420 

2451 

1092 

0 

2571 

2408 

205 

8 

2139 

1858 

949 

1645 

347 

2594 

2571 

0 

678 

2442 

9 

2182 

1737 

1021 

1891 

959 

2734 

2408 

678 

0 

2329 

10 

543 

597 

1494 

1220 

2300 

923 

205 

2442 

2329 

0 

Cities:  (1)  Atlanta,  (2)  Chicago,  (3)  Denver,  (4)  Houston.  (5)  Los  Angeles,  (6)  Miami,  (7)  New  York, 
(8)  San  Francisco,  (9)  Seattle,  (10)  Washington,  D.C. 
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2000 


1000 


-1000 


-2000 


Chicago  Wash i ngtonD .C . 


-2000 


-1000 


0 


1000 


2000 


Dimension  1 

Figure  15.2.  Plot  of  the  points  found  in  Example  15.1.2(b). 


where  nM  indicates  the  pair  of  items  with  the  smallest  dissimilarity  and  rmsm  rep¬ 
resents  the  pair  with  greatest  dissimilarity.  In  nonmetric  multidimensional  scaling, 
we  seek  a  low-dimensional  representation  of  the  points  such  that  the  rankings  of  the 
distances 


driSi  <  dr2S2  <  •  •  •  <  drmsm 


(15.6) 


match  exactly  the  ordering  of  dissimilarities  in  (15.5).  Thus,  although  metric  scaling 
uses  the  magnitudes  of  the  Sjj ’s,  nonmetric  scaling  is  based  only  on  the  rank  order  of 
the  Si j’  s. 

For  a  given  set  of  points  with  distances  d,j,  a  plot  of  djj  versus  Sjj  may  not  be 
monotonic;  that  is,  the  ordering  in  (15.6)  may  not  match  exactly  the  ordering  in 
(15.5).  A  lack  of  monotonicity  of  this  type  is  illustrated  in  Figure  15.3. 

In  Figure  15.3,  the  dashed  line  and  open  circles  show  some  values  of  d[j  that 
are  estimated  in  such  a  way  that  the  plot  becomes  monotonic.  Suitable  dij’s  can  be 
estimated  by  monotonic  regression,  in  which  we  seek  values  of  dj ;  to  minimize  the 
scaled  sum  of  squared  differences 


=  E 


Mj  -  dij)2 


i<j^u 


Zi 


<JdiJ 


(15.7) 
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Figure  15.3.  Plot  of  distance  d  versus  dissimilarity  <5  illustrating  lack  of  monotonicity.  The 
dashed  line  represents  best  fit  by  monotonic  regression. 


subject  to  the  constraint 


“n^l  <  dr2s2  <  ■  ■  ■  <  drmsm  , 

where  risi,  >~2S2,  ■  .  .  ,rmsm  are  defined  as  in  (15.5)  and  (15.6)  (Kruskal  1964a, 
1964b).  The  minimum  value  of  S1  2 3 4 5 6  for  a  given  dimension,  k,  is  called  the  STRESS. 
Note  that  the  dj j ’s  are  not  distances.  They  are  merely  numbers  used  as  a  reference  to 
assess  the  monotonicity  of  the  djf  s.  The  t/,;  ’s  are  sometimes  called  disparities. 

The  minimum  value  of  the  STRESS  over  all  possible  configurations  of  points  can 
be  found  using  the  following  algorithm. 

1.  Rank  the  in  —  n(n  —  l)/2  distances  or  dissimilarities  <5fy  as  in  (15.5). 

2.  Choose  a  value  of  k  and  an  initial  configuration  of  points  in  k  dimensions. 
The  initial  configuration  could  be  n  points  chosen  at  random  from  a  uniform 
or  normal  distribution,  n  evenly  spaced  points  in  ^-dimensional  space,  or  the 
metric  solution  obtained  by  treating  the  ordinal  measurements  as  continuous 
and  using  the  algorithm  in  Section  15.1.2. 

3.  For  the  initial  configuration  of  points,  find  the  interitem  distances  djj .  Find  the 
corresponding  djj’s  by  monotonic  regression  as  defined  above  using  (15.7). 

4.  Choose  a  new  configuration  of  points  whose  distances  djj  minimize  S2  in 
(15.7)  with  respect  to  the  djj’s  found  in  step  3.  One  approach  is  to  use  an  itera¬ 
tive  gradient  technique  such  as  the  method  of  steepest  descent  or  the  Newton- 
Raphson  method. 

5.  Using  monotonic  regression,  find  new  djj’s  for  the  djj’s  found  in  step  4.  This 
gives  a  new  value  of  STRESS. 

6.  Repeat  steps  4  and  5  until  STRESS  converges  to  a  minimum  over  all  possible 
A-dimensional  configurations  of  points. 
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1 - 1 - 1 - 1 - 1 - 1 - 1 - 1 - 

1  2  3  4  5  6  7 

Figure  15.4.  Ideal  plot  of  minimum  STRESS  versus  k. 


7.  Using  the  preceding  six  steps,  calculate  the  minimum  STRESS  for  values  of  k 
starting  at  k  =  1  and  plot  these.  As  k  increases,  the  curve  will  decrease,  with 
occasional  exceptions  due  to  round  off  or  numerical  anomalies  in  the  search 
procedure  for  minimum  STRESS.  We  look  for  a  discernible  bend  in  the  plot, 
following  which  the  curve  is  low  and  relatively  flat.  An  ideal  plot  is  shown  in 
Figure  15.4.  The  curve  levels  off  after  k  —  2,  which  is  convenient  for  plotting 
the  resulting  n  points  in  2  dimensions. 

There  is  a  possibility  that  the  minimum  value  of  STRESS  found  by  the  above 
seven  steps  for  a  given  value  of  k  may  be  a  local  minimum  rather  than  the  global 
minimum.  Such  an  anomaly  may  show  up  in  the  plot  of  minimum  STRESS  versus 
k.  The  possibility  of  a  local  minimum  can  be  checked  by  repeating  the  procedure, 
starting  with  a  different  initial  configuration. 

As  was  the  case  with  metric  scaling,  the  final  configuration  of  points  from  a  non¬ 
metric  scaling  is  invariant  to  a  rotation  of  axes. 

Example  15.1.3.  The  voting  records  for  15  congressmen  from  New  Jersey  on  19 
environmental  bills  are  given  in  Table  15.2  in  the  form  of  a  dissimilarity  matrix  (Hand 
et  al.  1994,  p.  235).  The  congressmen  are  identified  by  party:  R\  for  Republican  1, 
Di  for  Democrat  2,  etc.  Each  entry  shows  how  often  the  indicated  congressman 
voted  differently  from  each  of  the  other  14. 

Using  an  initial  configuration  of  points  from  a  multivariate  normal  distribution 
with  mean  vector  p,  =  0  and  X  =  I,  we  find  an  “optimal”  configuration  of  points  for 
each  of  k  —  1,2,...  ,  5.  A  plot  of  the  STRESS  is  given  in  Figure  15.5. 

From  the  plot  of  STRESS  vs.  number  of  dimensions,  we  see  that  either  two  or 
three  dimensions  will  be  sufficient.  For  plotting  purposes,  we  choose  two  dimen¬ 
sions,  which  has  a  STRESS  value  of  .113.  The  plot  of  the  first  two  dimensions  is 
given  in  Figure  15.6.  It  is  apparent  that  the  plot  separates  the  Republicans  from  the 
Democrats  except  for  Republican  6,  who  voted  much  the  same  as  the  Democrats. 
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Table  15.2.  Dissimilarity  Matrix  for  Voting  Records  of  15  Congressmen 


Ri 

Ri 

A 

A 

r3 

A 

a 

A 

A 

A 

£>6 

a 

Ri 

A 

£>7 

Ri 

0 

8 

15 

15 

10 

9 

7 

15 

16 

14 

15 

16 

7 

11 

13 

R2 

8 

0 

17 

12 

13 

13 

12 

16 

17 

15 

16 

17 

13 

12 

16 

A 

15 

17 

0 

9 

16 

12 

15 

5 

5 

6 

5 

4 

11 

10 

7 

A 

15 

12 

9 

0 

14 

12 

13 

10 

8 

8 

8 

6 

15 

10 

7 

R3 

10 

13 

16 

14 

0 

8 

9 

13 

14 

12 

12 

12 

10 

11 

11 

R* 

9 

13 

12 

12 

8 

0 

7 

12 

11 

10 

9 

10 

6 

6 

10 

Rs 

7 

12 

15 

13 

9 

7 

0 

17 

16 

15 

14 

15 

10 

11 

13 

d3 

15 

16 

5 

10 

13 

12 

17 

0 

4 

5 

5 

3 

12 

7 

6 

A 

16 

17 

5 

8 

14 

11 

16 

4 

0 

3 

2 

1 

13 

7 

5 

A 

14 

15 

6 

8 

12 

10 

15 

5 

3 

0 

1 

2 

11 

4 

6 

A 

15 

16 

5 

8 

12 

9 

14 

5 

2 

1 

0 

1 

12 

5 

5 

A 

16 

17 

4 

6 

12 

10 

15 

3 

1 

2 

1 

0 

12 

6 

4 

Ri 

7 

13 

11 

15 

6 

10 

12 

13 

11 

12 

12 

0 

9 

13 

A 

11 

12 

10 

10 

11 

6 

11 

7 

7 

4 

5 

6 

9 

0 

9 

A 

13 

16 

7 

7 

11 

10 

13 

6 

5 

6 

5 

4 

13 

9 

0 
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Figure  15.8.  Plot  of  points  found  using  initial  points  from  a  metric  solution. 


We  now  use  a  different  initial  configuration  of  points  drawn  from  a  uniform  dis¬ 
tribution.  The  resulting  plot  is  given  in  Figure  15.7.  The  results  are  very  similar,  with 
the  exception  of  R[  and  R$. 

We  next  use  a  third  initial  configuration  of  points  resulting  from  the  metric  solu¬ 
tion,  as  described  in  Section  15.1.2.  The  resulting  plot  is  given  in  Figure  15.8.  All 
three  plots  are  very  similar,  indicating  a  good  fit.  □ 


15.2  CORRESPONDENCE  ANALYSIS 
15.2.1  Introduction 

Correspondence  analysis  is  a  graphical  technique  for  representing  the  information 
in  a  two-way  contingency  table,  which  contains  the  counts  (frequencies)  of  items 
for  a  cross-classification  of  two  categorical  variables.  With  correspondence  analysis, 
we  construct  a  plot  that  shows  the  interaction  of  the  two  categorical  variables  along 
with  the  relationship  of  the  rows  to  each  other  and  of  the  columns  to  each  other.  In 
Sections  15.2.2-15.2.4,  we  consider  correspondence  analysis  for  ordinary  two-way 
contingency  tables.  In  Section  15.2.5  we  consider  multiple  correspondence  analysis 
for  three-way  and  higher-order  contingency  tables.  Useful  treatments  of  correspon¬ 
dence  analysis  have  been  given  by  Greenacre  (1984),  Jobson  (1992,  Section  9.4), 
Khattree  and  Naik  (1999,  Chapter  7),  Gower  and  Hand  (1996,  Chapters  4  and  9), 
and  Benzecri  (1992). 
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To  test  for  significance  of  association  of  the  two  categorical  variables  in  a  con¬ 
tingency  table,  we  could  use  a  chi-square  test  or  a  log-linear  model,  both  of  which 
represent  an  asymptotic  approach.  Since  correspondence  analysis  is  associated  with 
the  chi-square  approach,  we  will  review  it  in  Section  15.2.3.  If  a  contingency  table 
has  some  cell  frequencies  that  are  small  or  zero,  the  chi-square  approximation  is  not 
very  satisfactory.  In  this  case,  some  categories  can  be  combined  to  increase  the  cell 
frequencies.  Correspondence  analysis  may  be  useful  in  identifying  the  categories  that 
are  similar,  which  we  may  thereby  wish  to  combine. 

In  correspondence  analysis,  we  plot  a  point  for  each  row  and  a  point  for  each 
column  of  the  contingency  table.  These  points  are,  in  effect,  projections  of  the  rows 
and  columns  of  the  contingency  table  onto  a  two-dimensional  Euclidean  space.  The 
goal  is  to  preserve  as  far  as  possible  the  relationship  of  the  rows  (or  columns)  to 
each  other  in  a  two-dimensional  space.  If  two  row  points  are  close  together,  the 
profiles  of  the  two  rows  (across  the  columns)  are  similar.  Likewise,  two  column 
points  that  are  close  together  represent  columns  with  similar  profiles  across  the  rows 
(see  Section  15.2.2  for  a  definition  of  profiles).  If  a  row  point  is  close  to  a  column 
point,  this  combination  of  categories  of  the  two  variables  occurs  more  frequently 
than  would  occur  by  chance  if  the  two  variables  were  independent.  Another  output 
of  a  correspondence  analysis  is  the  inertia,  or  amount  of  information  in  each  of  the 
two  dimensions  in  the  plot  (see  Section  15.2.4). 


15.2.2  Row  and  Column  Profiles 

A  contingency  table  with  a  rows  and  b  columns  is  represented  in  Table  15.3.  The 
entries  Hjj  are  the  counts  or  frequencies  for  every  two-way  combination  of  row  and 
column  (every  cell).  The  marginal  totals  are  shown  using  the  familiar  dot  notation: 
Hi.  —  £*_ t  nij  and  n.j  =  £"=1  nij-  The  overall  total  frequency  is  denoted  by  n 
instead  of  n  for  simplicity:  n  —  £(  /  n,  j. 

The  frequencies  iijj  in  a  contingency  table  can  be  converted  to  relative  frequencies 
Pij  by  dividing  by  n :  pt /  =  ii/j/n.  The  matrix  of  relative  frequencies  is  called  the 
correspondence  matrix  and  is  denoted  by  P: 


Table  15.3.  Contingency  Table  with  a  Rows  and  b  Columns 


1 

Columns 

2  •••  b 

Row  Total 

1 

«n 

n  12 

nib 

n  i. 

2 

«21 

«22 

Il2b 

n  2. 

Rows 

a 

Hal 

Hal 

II  ab 

na. 

Column  Total 

n.  l 

n.2 

n.i, 

n 
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Table  15.4.  Correspondence  Matrix  of  Relative  Frequencies 


1 

Columns 

2  •••  b 

Row  Total 

1 

Pn 

Pl2 

Plb 

Pi. 

2 

P21 

P'12 

Plb 

Pi. 

Rows 

a 

Pal 

Pal 

Pab 

Pa. 

Column  Total 

Pa 

P.2 

P.b 

1 

P  =(pii)  =  (nij/n).  (15.8) 

In  Table  15.4  we  show  the  contingency  table  in  Table  15.3  converted  to  a  correspon¬ 
dence  matrix. 

The  last  column  of  Table  15.4  contains  the  row  sums  p,  =  Y^)=i  Pij •  This  col¬ 
umn  vector  is  denoted  by  r  and  can  be  obtained  as 

r  =  Pj  =  (pi.,  P2.,--  -  ,  Pa.y  =  ( n\./n,n2./n , ...  ,naJn)',  (15.9) 

where  j  is  an  a  x  1  vector  of  l’s.  Similarly,  the  last  row  of  Table  15.4  contains  the 
column  sums  p  j  —  Y^i=\  Pij  ■  This  row  vector  is  denoted  by  c'  and  can  be  obtained 
as 


c'  =  j'P  =  (p.  t,  p.2,  ■■■  ,  P.b )  =  (n.i/n,  n.2/n, ...  ,  n.b/n),  (15.10) 


where  j'  is  a  1  x  b  vector  of  l’s.  The  elements  of  the  vectors  r  and  c  are  sometimes 
referred  to  as  row  and  column  mosses.  The  correspondence  matrix  and  marginal 
totals  in  Table  15.4  can  be  expressed  as 


"  PI  1 

P 12  '  ' 

■  ■  Plb 

Pi. 

P21 

P22  ■  ' 

■  ■  P2b 

P2. 

Pal 

Pal  ■  ' 

■  ■  Pab 

Pa. 

,  PA 

P.2  ■' 

■  '  P.b 

1 

(15.11) 


We  now  convert  each  row  and  column  of  P  to  a  profile.  The  ith  row  profile  r', 
i  —  1,2,...  .a.  is  defined  by  dividing  the  ;th  row  of  either  Table  15.3  or  15.4  by  its 
marginal  total: 


Pn 

Pi  2 

Pib  \ 

i  = 

n/2 

nib 

Pi.  ’ 

Pi.  ’  ’  ’ 

’  ’  Pi-  J 

V  ni. 

Hi.  ’  ’  ’ 

’  ’  «f. 

(15.12) 
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The  elements  in  each  r'  are  relative  frequencies,  and  therefore  they  sum  to  1 : 

b 


r'j  =  V—  =  —  =  1. 

ni  nj 

j= i 


(15.13) 


By  defining 


(  PL 

0 

■■■  0  \ 

D,  = 

=  diag(r)  = 

0 

P2. 

...  o 

V  0 

0 

•"  Pa.  / 

and  using  (2.55),  the  matrix  R  of  row  profiles 

can 

ie  expressed  as 

/  pu_ 

P 12 

Pib 

\ 

(  r'  \ 

r' 

Pi. 

P2\ 

Pi. 

P22 

Pi. 

P2b 

R  =  D,71P  = 

l2 

= 

P2. 

P2. 

P2. 

\  <  J 

Pal 

Pal 

Pab 

V  Pa. 

Pa. 

Pa. 

/ 

(15.14) 


(15.15) 


Similarly,  the  yth  column  profile  c/,  j  —  1,2,...  ,  b,  is  defined  by  dividing  the 
yth  column  of  either  Table  15.3  or  Table  15.4  by  its  marginal  total: 


Ell 

pv_ 

Pal \ 

_  ( ElL  EEL 

n_aj\ 

V  p.j  ’ 

Pi  ’  ’  ’ 

'  ’  Pj  ) 

\n  j ’ n  i  ’ ” 

■  ’  nj  J 

(15.16) 


The  elements  in  each  c j  are  relative  frequencies,  and  therefore  they  sum  to  1 : 


Enij 

r: 


=i  nJ 


By  defining 


=  diag(c) 


/  PA  0 
0  p.2 


°  ^ 
0 


V  o  o  ...  p.b  ) 


(15.17) 


(15.18) 


and  using  (2.56),  the  matrix  C  of  column  profiles  can  be  expressed  as 
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/  £U_  P12  Pla_  \ 


PA 

P.2 

P.a 

P2\ 

P22 

P2a 

C  =  PDc1  =  (ci,C2,.. 

-  ,c b)  = 

PA 

P.2 

P.a 

Pal 

Pa2 

Pab 

\  PA 

P.2 

P.a  / 

The  vector  r  is  defined  in  (15.9)  as  the  column  vector  of  row  sums  of  P.  It  can 
also  be  expressed  as  a  weighted  average  of  the  column  profiles: 

b 

r  =Y,P-l*j-  (15-20) 

j=  i 

Similarly,  c'  in  (15.10)  is  the  row  vector  of  column  sums  of  P  and  can  be  expressed 
as  a  weighted  average  of  the  row  profiles: 

a 

c'  =  J2  P'S,  ■  (15-21) 

i=  1 

Note  that  T!)=\  Pj  =  £“=l  Pi-  =  1.  or 

j'r  =  c'j  =  l,  (15.22) 

where  the  first  j  is  a  x  1  and  the  second  is  b  x  1.  Therefore,  the  p.j’s  and  p/  ’s  serve 
as  appropriate  weights  in  the  weighted  averages  (15.20)  and  (15.21). 

Example  15.2.2.  In  Table  15.5  (Hand  et  al.  1994,  p.  12)  we  have  the  number  of 
piston  ring  failures  in  each  of  three  legs  in  four  compressors  found  in  the  same 
building  (all  four  compressors  are  oriented  in  the  same  direction).  We  obtain  the 
correspondence  matrix  in  Table  15.6  by  dividing  each  element  of  Table  15.5  by 
«  =  E/y  nU  =  166- 


Table  15.5.  Piston  Ring  Failures 


Compressor 

North 

Leg 

Center 

South 

Row  Total 

1 

17 

17 

12 

46 

2 

11 

9 

13 

33 

3 

11 

8 

19 

38 

4 

14 

7 

28 

49 

Column  Total 

53 

41 

72 

166 
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Table  15.6.  Correspondence  Matrix  Obtained  from  Table  15.5 


Compressor 

North 

Leg 

Center 

South 

Row  Total 

1 

.102 

.102 

.072 

.277 

2 

.066 

.054 

.078 

.199 

3 

.066 

.048 

.114 

.229 

4 

.082 

.042 

.169 

.295 

Column  Total 

.319 

.247 

.434 

1.000 

The  vectors  of  row  and  column  sums  (marginal  totals)  in  Table  15.6  are  given  by 
(15.9)  and  (15.10)  as 

/  .277  \ 

.199 

r  -  .229  ’ 

V  -295  / 

c'  =  (.319,  .247,  .434). 

The  matrix  of  row  profiles  is  given  by  (15.15)  as 


/  .370 

.370 

.261 

\ 

R  =  D,71P  = 

.333 

.273 

.394 

.290 

.211 

.500 

v  .286 

.143 

.571 

/ 

The  matrix  of  column  profiles  is  given  by  (15. 

.19)  as 

/  .321 

.415 

.167 

\ 

c  =  pd;1  = 

.208 

.220 

.181 

.208 

.195 

.264 

v  .264 

.171 

.389 

/ 

15.2.3  Testing  Independence 

In  Section  15.2.1,  we  noted  that  the  data  in  a  contingency  table  can  be  used  to  check 
for  association  of  two  categorical  variables.  If  the  two  variables  are  denoted  by  x  and 
y,  then  the  assumption  of  independence  can  be  expressed  in  terms  of  probabilities  as 

P (x{ yj )  =  P (xi) P (y j) ,  i  =  1,  2, . . .  ,  a\  j  —  1,  2, . . .  ,  b,  (15.23) 

where  Xj  and  yj  correspond  to  the  ;th  row  and  / th  column  of  the  contingency  table. 
Using  the  notation  in  Table  15.4,  we  can  estimate  (15.23)  as 

Pij  =  Pi.P.j >  i  =  1,2, ...  ,a;  j  =  1,2, ...  ,b.  (15.24) 
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The  usual  chi-square  statistic  for  testing  independence  of  x  and  y  (comparing  p,j 
with  Pi.p.j  for  all  i,  j )  is  given  by 


(Pij  ~  Pi.P.j )1 2 


r-EE' 

UU  P'Pj 


(15.25) 


which  is  approximately  (asymptotically)  distributed  as  a  chi-square  random  variable 
with  (a  —  1  )(b  —  1)  degrees  of  freedom.  The  statistic  in  (15.25)  can  also  be  written 
in  terms  of  the  frequencies  Hjj  rather  than  the  relative  frequencies  pij : 


X 


2 


a  b 


EE 


ni.n 

n 


ni  n  j 

n 


(15.26) 


Two  other  alternative  forms  of  (15.25)  are 

a  b 

x2  =  E>'.E 

;=1  ;= 1 


x2  =  J2npj  E 

7=1  1=1  L 


Pij  ,  , 
—  -  Pi.  I  Pi. 
PJ 


In  vector  and  matrix  form,  (15.27)  and  (15.28)  can  be  written  as 


X2  =  E>,(r«  -  c/Dc  1  (r,-  -  c), 
f=t 

b 

X2  =  Ysnp  j(cj  _  r),Di7'(C7  - 
7=1 


(15.27) 

(15.28) 


(15.29) 

(15.30) 


where  r,  c,  r,-,  c,-,  Dr,  and  Dc  are  defined  in  (15.9),  (15.10),  (15.12),  (15.16),  (15.14), 
and  (15.18),  respectively.  Thus,  in  (15.29)  we  compare  r,  to  c  for  each  i,  and  in 
(15.30)  we  compare  c /  to  r  for  each  j.  Either  of  these  is  equivalent  to  testing  inde¬ 
pendence  by  comparing  pij  to  Pi.p.j  for  all  i,  j,  since  all  the  definitions  of  /2  in 
(15.25)— (15.30)  are  equal.  Thus,  the  following  three  statements  of  independence  are 
equivalent  (for  simplicity,  we  express  the  three  statements  in  terms  of  sample  quan¬ 
tities  rather  than  their  theoretical  counterparts): 


1.  p/j  =  pi.p.j  for  all  i,  j  (or  P  =  re'). 

2.  All  rows  r'-  of  R  in  (15.15)  are  equal  (and  equal  to  their  weighted  average,  c'). 

3.  All  columns  c j  of  C  in  (15.19)  are  equal  (and  equal  to  their  weighted  aver¬ 

age,  r). 
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Thus,  if  the  variables  x  and  y  were  independent,  we  would  expect  the  rows  of  the 
contingency  table  to  have  similar  profiles,  or  equivalently,  the  columns  to  have  sim¬ 
ilar  profiles.  We  can  compare  the  row  profiles  to  each  other  by  comparing  each  row 
profile  r'  to  the  weighted  average  c'  of  the  row  profiles  defined  in  (15.21).  This  com¬ 
parison  is  made  in  the  y2  statistic  (15.29).  Similarly,  we  compare  column  profiles  in 
(15.30). 

The  chi-square  statistic  in  (15.25)  can  be  expressed  in  vector  and  matrix  terms  as 

X2  =  n  tr[D,7!(P  —  rc^D”1  (P  —  rc,)/]  (15.31) 

k 

=  nJ2^i  [by  (2.107)],  (15.32) 

i=i 

where  Aj,  ...  ,  Ar  are  the  nonzero  eigenvalues  of  D“ 1  (P  —  rc')I)f 1  (P  —  re')'  and 

k  =  rank[D(-1  (P  -  rc')D7'(P  -  rc7]  =  rank(P  -  re')-  (15.33) 

The  rank  of  P  —  re'  is  ordinarily  k  —  min[(«  —  I ) ,  (b  —  1)].  It  is  clear  that  the  rank 
is  less  than  min(<7,  b)  since 

(P  —  rc)j  =  Pj  —  rej  =  r  —  r  =  0  (15.34) 


[see  (15.9)  and  (15.22)]. 

Example  15.2.3.  In  order  to  test  independence  of  the  rows  (compressors)  and 
columns  (legs)  of  Table  15.5  in  Example  15.2.2,  we  perform  a  chi-square  test. 
Using  (15.25)  or  (15.26),  we  obtain  y2  —  11.722,  with  6  degrees  of  freedom,  for 
which  the  /7-value  is  .0685.  There  is  some  evidence  of  lack  of  independence  between 
leg  and  compressor.  □ 


15.2.4  Coordinates  for  Plotting  Row  and  Column  Profiles 

We  now  obtain  coordinates  of  the  row  points  and  column  points  for  the  best  two- 
dimensional  representation  of  the  data  in  a  contingency  table.  As  we  will  see,  the 
metric  for  the  row  points  and  column  points  is  the  same,  and  the  two  sets  of  points 
can  therefore  be  superimposed  on  the  same  plot. 

In  multidimensional  scaling  in  Section  15.1,  we  transformed  the  distance  matrix 
and  then  factored  it  by  a  spectral  decomposition  to  obtain  coordinates  for  plotting.  In 
correspondence  analysis,  the  matrix  P  —  re'  is  not  symmetric,  and  we  therefore  use 
a  singular  value  decomposition  to  obtain  coordinates. 

We  first  scale  P  —  re'  to  obtain 

Z  =  D,71/2(P-rc,)D“1/2, 


(15.35) 
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whose  elements  are 


PU  ~  Pi  P  i 

sfKKJ 


(15.36) 


as  in  (15.25).  The  a  x  b  matrix  Z  has  rank  k  =  min(«  —  1,  b  —  1),  the  assumed  rank 
of  P  —  re'.  We  factor  Z  using  the  singular  value  decomposition  (2.117): 


Z  =  UAV'.  (15.37) 

The  columns  of  the  a  x  k  matrix  U  are  (normalized)  eigenvectors  of  ZZ' ;  the 
columns  of  the  b  x  k  matrix  V  are  (normalized)  eigenvectors  of  Z'Z;  and  A  = 
diag (A i ,  A2, . . .  ,  A.&),  where  Xj,  A|, . . .  ,  Ay  are  the  nonzero  eigenvalues  of  Z  Z  and 
of  ZZ'.  The  eigenvectors  in  U  and  Y  correspond  to  the  eigenvalues  Aj,  /a,  . . .  ,  Aj.. 
Since  the  columns  of  U  and  V  are  orthonormal,  U'U  =  V'V  =  I.  The  values  A] , 
7.2, ...  ,  Xk  in  A  are  called  the  singular  values  of  Z.  Note  that,  by  (15.35), 

ZZ'  =  D,71/2(P  -  rc,)D^1/2D^1/2(P  -  rc')'D;:1/2 

=  D7I/2(P-  rc')D(7'(P  —  rc')'D7l/2.  (15.38) 

The  (nonzero)  eigenvalues  of  ZZ'  in  (15.38)  are  the  same  as  those  of 

D,71/2D,71/2(P  -  rc')^1  (P  -  re')'  (15.39) 

(see  Section  2.1 1.5).  The  matrix  expression  in  (15.39)  is  the  same  as  that  in  (15.31). 
We  have  therefore  denoted  the  eigenvalues  as  Ay,  a^,  ...  ,  Ay  as  in  (15.32). 

We  can  obtain  a  decomposition  of  P  —  re'  by  equating  the  right-hand  sides  of 
(15.35)  and  (15.37)  and  solving  for  P  —  re': 

D7I/2(P  -  rc')D“1/2  =  UAV', 

P  -  re'  =  D,1/2UAV'Dy2 

k 

=  AAB'  =  ^  A,-  a ,  b; ,  (1 5 .40) 

f=i 

1/2  1/2 

where  A  =  I)/  U,  B  =  DA  Y,  a,  and  b  i  are  columns  of  A  and  B,  and  A  = 
diag(Ai,  A2, ...  .A*). 

Since  U'U  =  V'V  =  I,  A  and  B  in  (15.40)  are  scaled  so  that  A'D^'A  = 
B  Dt'B  =  I.  With  this  scaling,  the  decomposition  in  (15.40)  is  often  called  the 
generalized  singular  value  decomposition. 
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In  (15.40)  the  rows  of  P  —  re'  are  expressed  as  linear  combinations  of  the  rows  of 

B',  which  are  the  columns  of  B  =  (bi,  b2 . bt)-  The  coordinates  (coefficients) 

for  the  i  th  row  of  P  —  re'  are  found  in  the  i  th  row  of  AA.  In  like  manner,  the  coordi¬ 
nates  for  the  columns  of  P  —  re'  are  given  by  the  columns  of  AB',  since  the  columns 
of  AB'  provide  coefficients  for  the  columns  of  A  =  (ai,  a2,  . . .  ,  a*.-)  in  (15.40). 

To  find  coordinates  for  the  row  deviations  r(  —  c'  in  R  —  jc'  and  the  column 
deviations  c  /  —  r  in  C  —  r  j\  we  express  the  two  matrices  as  functions  of  P  —  re'  (see 
Problem  15.8): 


R-jc' =  D7x(P-rc),  (15.41) 

C-rj'  =  D“1(P-rc).  (15.42) 


Thus  the  coordinates  for  the  row  deviations  in  R  —  jc'  with  respect  to  the  axes  pro¬ 
vided  by  bi ,  b2, . . .  ,  b^  are  given  by  the  columns  of 

X  =  D“XAA.  (15.43) 

Similarly,  the  coordinates  for  the  column  deviations  in  C  —  r  j'  with  respect  to  the 
axes  aj ,  a2, . . .  ,  a*  are  given  by  the  columns  of 

Y  =  D“'BA.  (15.44) 


Therefore,  to  plot  the  coordinates  for  the  row  profile  deviations  r'.  —  c i  —  1, 
2,  . . .  ,  a,  in  two  dimensions,  we  plot  the  rows  of  the  first  two  columns  of  X: 


Xi 


/  *1  1  X\2  \ 

*21  *22 


\  Xa\  Xa2  J 


Similarly,  to  plot  the  coordinates  for  the  column  profile  deviations  C  j  —  r,  j  =  1, 
2, . . .  ,b,  in  two  dimensions,  we  plot  the  rows  of  the  first  two  columns  of  Y : 


<  yn 

yn  ^ 

yn 

yn 

\  ybi 

}’b2  / 

Both  plots  can  be  superimposed  on  the  same  graph  because  A  and  B  in  (15.40) 
share  the  same  singular  values  ki,  A.2, . .  ■  ,  A.&  in  A.  Distances  between  row  points 
and  distances  between  column  points  are  meaningful.  For  example,  the  distance 
between  two  row  points  is  related  to  the  chi-square  metric  implicit  in  (15.29).  The 
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chi-square  distance  between  two  row  profiles  r,  and  r j  is  given  by 
dfj  =  in  -  r/l'D^lr,  _  rf). 

If  two  row  points  (or  two  column  points)  are  close,  the  two  rows  (or  two  columns) 
could  be  combined  into  a  single  category  if  necessary  to  improve  the  chi-square 
approximation. 

The  distance  between  a  row  point  and  a  column  point  is  not  meaningful,  but  the 
proximity  of  a  row  point  and  a  column  point  has  meaning  as  noted  in  Section  15.2.1, 
namely,  that  these  two  categories  of  the  two  variables  occur  more  frequently  than 
would  be  expected  to  happen  by  chance  if  the  two  variables  were  independent. 

The  weighted  average  (weighted  by  pi)  of  the  chi-square  distances  (r,-  — 
c)/D~1  (r/  —  c)  between  the  row  profiles  r,  and  their  mean  c  [see  (15.21)]  is  called 
the  total  inertia.  By  (15.29)  this  can  be  expressed  as  x"/n: 

y  2  a 

Total  inertia  =  —  =  7  pi  (r,-  —  c)rD7 1  (r,  —  c).  (15.45) 

n  “ 

1  =  1 

As  noted  following  (15.21),  p,  =  1,  and  therefore  the  p,’s  serve  as  appropriate 
weights. 

By  (15.32),  we  can  write  (15.45)  as 

—  =  (15-46) 

n  £1 

Therefore,  the  contribution  of  each  of  the  first  two  dimensions  (axes)  of  our  plot  to 
the  total  inertia  in  (15.45)  is  Xy  JT  Xj  and  X\ /  JA  Xj.  The  combined  contribution 
of  the  two  dimensions  is 


Ay  +  X\ 

TL^i' 


(15.47) 


If  (15.47)  is  large,  then  the  points  in  the  plane  of  the  first  two  dimensions  account 
for  nearly  all  the  variation  in  the  data,  including  the  associations.  The  total  inertia  in 
(15.45)  and  (15.46)  can  also  be  described  in  terms  of  the  columns  by  using  (15.30): 


2  b  k 

Total  inertia  =  —  =  P.j(cj  —  rl'D^VC/  —  r)  =  Xj.  (15.48) 
H  7=1  1=1 


Since  the  inertia  associated  with  the  axes  for  columns  is  the  same  as  that  for  rows, 
the  row  and  column  points  can  be  plotted  on  the  same  axes. 

Some  computer  programs  use  a  singular  value  decomposition  of  P  rather  than  of 
P  —  re'.  The  results  are  the  same  if  the  first  singular  value  (which  is  1)  is  discarded 
along  with  the  first  column  of  A  (which  is  r)  and  the  first  column  of  B  (which  is  c). 
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0.2 


0.1 


-0.1 


-0.2 


. i . i . i . i . i . i . i . f 

0.4  -0.3  -0.2  -0.1  0  0.1  0.2  0.3  0.4 


Dimension  1 

Figure  15.9.  Row  points  (1,  2,  3,  4)  and  column  points  (center,  north,  south). 


Example  15.2.4.  We  continue  the  analysis  of  the  piston  ring  data  of  Table  15.5.  A 
correspondence  analysis  is  performed  and  a  plot  of  the  row  and  column  points  is 
given  in  Figure  15.9.  Row  points  do  not  lie  near  other  row  points  and  columns  points 
do  not  lie  near  column  points.  Flowever,  compressor  1  seems  to  be  closely  associated 
with  the  center  leg,  compressor  2  with  the  north  leg,  and  compressor  4  with  the  south 
leg.  These  associations  illustrate  the  lack  of  independence  between  compressor  and 
leg  position. 

Singular  values  and  inertias  are  given  in  Table  15.7.  Most  of  the  variation  is  due 
to  the  first  dimension,  and  the  first  two  dimensions  explain  all  the  variation  because 
rank(Z)  =  minfo  —  =  min(4—  1,3—1)  =  2,  where  Z  is  defined  in  (15.35). 

□ 


Table  15.7.  Singular  Values  (A,),  Inertia  (A?),  Chi-Square 
(nXf),  and  Percent  (A? /  JA  A p  for  the  Data  in  Table  15.5 


Singular  Value 

Inertia 

Chi-Square 

Percent 

.26528 

.07037 

11.6819 

99.66 

.01560 

.00024 

.0404 

.34 

Total 

.07062 

11.7223 

100 
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15.2.5  Multiple  Correspondence  Analysis 

Correspondence  analysis  of  a  two-way  contingency  table  can  be  extended  to  a  three- 
way  or  higher-order  multiway  table.  By  the  method  of  multiple  correspondence  anal¬ 
ysis,  we  obtain  a  two-dimensional  graphical  display  of  the  information  in  the  multi¬ 
way  contingency  table.  The  method  involves  a  correspondence  analysis  of  an  indi¬ 
cator  matrix  G,  in  which  there  is  a  row  for  each  item.  Thus  the  number  of  rows  of  G 
is  the  total  number  of  items  in  the  sample.  The  number  of  columns  of  G  is  the  total 
number  of  categories  in  all  variables.  The  elements  of  G  are  l’s  and  0’s.  In  each  row, 
an  element  is  1  if  the  item  belongs  in  the  corresponding  category  of  the  variable; 
otherwise,  the  element  is  0.  Thus  the  number  of  l’s  in  a  row  of  G  is  the  number  of 
variables;  for  a  four-way  contingency  table,  for  example,  there  would  be  four  l's  in 
each  row  of  G. 

We  illustrate  a  four-way  classification  with  the  (contrived)  data  in  Table  15.8. 
There  are  n  —  12  items  (people)  and  p  =  4  categorical  variables.  The  four  variables 
and  their  categories  are  listed  in  Table  15.9.  The  indicator  matrix  G  for  the  data  in 
Table  15.8  is  given  in  Table  15.10. 

A  correspondence  analysis  on  G  is  equivalent  to  a  correspondence  analysis  on 
G'G,  which  is  called  the  Burt  matrix.  This  equivalence  can  be  justified  as  follows. 
In  the  singular  value  decomposition  G  =  UA  V'.  the  matrix  Y  contains  eigenvectors 


Table  15.8.  A  List  of  12  People  and  Their  Categories  on  Four  Variables 


Person 

Gender 

Age 

Marital  Status 

Hair  Color 

1 

Male 

Young 

Single 

Brown 

2 

Male 

Old 

Single 

Red 

3 

Female 

Middle 

Married 

Blond 

4 

Male 

Old 

Single 

Black 

5 

Female 

Middle 

Married 

Black 

6 

Female 

Middle 

Single 

Brown 

7 

Male 

Young 

Married 

Red 

8 

Male 

Old 

Married 

Blond 

9 

Male 

Middle 

Single 

Brown 

10 

Female 

Young 

Married 

Black 

11 

Female 

Old 

Single 

Brown 

12 

Male 

Young 

Married 

Blond 

Table  15.9. 

The  Categories  for  the  Four  Variables  in  Table  15.8 

Variable 

Levels 

Gender 

Male,  female 

Age 

Young,  middle-aged,  old 

Marital  status  Single,  married 

Hair  color 

Blond,  brown,  black,  red 
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Table  15.10.  Indicator  Matrix  G  for  the  Data  in  Table  15.8 


Person  Gender  Age  Marital  Status  Hair  Color 


1 

1 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

2 

1 

0 

0 

0 

1 

1 

0 

0 

0 

0 

1 

3 

0 

1 

0 

1 

0 

0 

1 

1 

0 

0 

0 

4 

1 

0 

0 

0 

1 

1 

0 

0 

0 

1 

0 

5 

0 

1 

0 

1 

0 

0 

1 

0 

0 

1 

0 

6 

0 

1 

0 

1 

0 

1 

0 

0 

1 

0 

0 

7 

1 

0 

1 

0 

0 

0 

1 

0 

0 

0 

1 

8 

1 

0 

0 

0 

1 

0 

1 

1 

0 

0 

0 

9 

1 

0 

0 

1 

0 

1 

0 

1 

0 

0 

0 

10 

0 

1 

1 

0 

0 

0 

1 

0 

0 

1 

0 

11 

0 

1 

0 

0 

1 

1 

0 

0 

1 

0 

0 

12 

1 

0 

1 

0 

0 

0 

1 

1 

0 

0 

0 

of  G'G.  The  same  matrix  Y  would  be  used  in  the  spectral  decomposition  of  G'G. 
Thus  the  columns  of  V  are  used  in  plotting  coordinates  for  the  columns  of  G  or  the 
columns  of  G'G.  If  G  is  n  x  p  with  p  <  n,  then  G'G  would  be  smaller  in  size  than  G. 

The  Burt  matrix  G'G  has  a  square  block  on  the  diagonal  for  each  variable  and 
a  rectangular  block  off-diagonal  for  each  pair  of  variables.  Each  diagonal  block  is 
a  diagonal  matrix  showing  the  frequencies  for  the  categories  in  the  corresponding 
variable.  Each  off-diagonal  block  is  a  two-way  contingency  table  for  the  correspond¬ 
ing  pair  of  variables.  In  Table  15.11,  we  show  the  G'G  matrix  for  the  G  matrix  in 
Table  15.10. 

A  correspondence  analysis  of  G'G  yields  only  column  coordinates.  A  point  is 
plotted  for  each  column  of  G  (or  of  G'G).  Thus  each  point  represents  a  category 
(attribute)  of  one  of  the  variables. 


Table  15.11.  Burt  Matrix  G'G  for  the  Matrix  G  in  Table  15.10 


Gender 

Age 

Marital  Status 

Hair  Color 

7 

0 

3 

1 

3 

4 

3 

3 

1 

1 

2 

0 

5 

1 

3 

1 

2 

3 

1 

2 

2 

0 

3 

1 

4 

0 

0 

1 

3 

1 

1 

1 

1 

1 

3 

0 

4 

0 

2 

2 

2 

1 

1 

0 

3 

1 

0 

0 

4 

3 

1 

1 

1 

1 

1 

4 

2 

1 

2 

3 

6 

0 

1 

3 

1 

1 

3 

3 

3 

2 

1 

0 

6 

3 

0 

2 

1 

3 

1 

1 

2 

1 

1 

3 

4 

0 

0 

0 

1 

2 

1 

1 

1 

3 

0 

0 

3 

0 

0 

1 

2 

1 

1 

1 

1 

2 

0 

0 

3 

0 

2 

0 

1 

0 

1 

1 

1 

0 

0 

0 

2 
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Table  15.12.  Singular  Values  (A,),  Inertia  (A?),  and  Chi-square  (nA?)  for  the  Burt  Matrix 
G'G  in  Table  15.11 


Singular  Value 

Inertia 

Chi-Square 

Percent 

.68803 

.47338 

31.551 

27.05 

.67451 

.45497 

30.324 

26.00 

.51492 

.26515 

17.672 

15.15 

.50000 

.25000 

16.663 

14.29 

.41941 

.17590 

11.724 

10.05 

.33278 

.11074 

7.381 

6.33 

.14091 

.01986 

1.323 

1.13 

Total 

1.75000 

116.638 

100.00 

Distances  between  points  are  not  as  meaningful  as  in  correspondence  analysis, 
but  points  in  the  same  quadrant  or  approximate  vicinity  indicate  an  association.  If 
two  close  points  represent  attributes  of  the  same  variable,  the  two  attributes  may  be 
combined  into  a  single  attribute. 

Since  the  Burt  matrix  G'G  has  only  two-way  contingency  tables,  three-way  and 
higher-order  interactions  are  not  represented  in  the  plot.  The  various  two-way  tables 
are  analyzed  simultaneously,  however. 


6 

5 


0 


-1 


-2  -i 


-2 


b  1  ond 
narr  i  etf 


female 

* 

niddle 


-1  0  1 
Dimension  1 


Figure  15.10.  Plot  of  points  representing  the  11  columns  of  Table  15.10  or  15.11. 
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Example  15.2.5(a).  We  continue  the  illustration  in  this  section.  A  correspondence 
analysis  of  the  Burt  matrix  G'G  in  Table  15.11  yields  the  singular  values,  inertia, 
and  chi-squared  values  in  Table  15.12.  The  first  two  singular  values  account  for  only 
53.05%  of  the  total  variation.  A  plot  of  the  first  two  dimensions  for  the  1 1  columns 
in  Table  15.10  or  15.11  is  given  in  Figure  15.10.  It  appears  that  married  and  blond 
hair  have  a  greater  association  that  would  be  expected  by  chance  alone.  Another 
association  is  that  between  female  and  middle  age.  □ 

Example  15.2.5(b).  Table  15.13  (Edwards  and  Kreiner  1983)  is  a  five-way  con¬ 
tingency  table  of  employed  men  between  the  ages  of  18  and  67  who  were  asked 
whether  they  themselves  carried  out  repair  work  on  their  home,  as  opposed  to  hiring 
a  craftsperson  to  do  the  job.  The  five  categorical  variables  are  as  follows: 

Work  of  respondent:  skilled,  unskilled,  office. 

Tenure:  rent,  own. 

Age:  under  30,  31-45,  over  45, 

Accommodation  type:  apartment,  house. 

Response  to  repair  question:  yes,  no. 

A  multiple  correspondence  analysis  produced  the  intertia  and  singular  values  in 
Table  15.14.  The  plot  of  the  first  two  dimensions  is  given  in  Figure  15.11. 


Table  15.13.  Do-It-Yourself  Data 


Accommodation  Type 

Apartment 

House 

Age 

Age 

Work 

Tenure 

Response 

<30 

31-45 

>46 

<30 

31-45 

>46 

Rent 

Yes 

18 

15 

6 

34 

10 

2 

No 

15 

13 

9 

28 

4 

6 

Skilled 

Own 

Yes 

5 

3 

1 

56 

56 

35 

No 

1 

1 

1 

12 

21 

8 

Rent 

Yes 

17 

10 

15 

29 

3 

7 

No 

34 

17 

19 

44 

13 

16 

Unskilled 

Own 

Yes 

2 

0 

3 

23 

52 

49 

No 

3 

2 

0 

9 

31 

51 

Rent 

Yes 

30 

23 

21 

22 

13 

21 

No 

25 

19 

40 

25 

16 

12 

Office 

Own 

Yes 

8 

5 

1 

54 

191 

102 

No 

4 

2 

2 

19 

76 

61 
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Table  15.14.  Singular  Values  (A,),  Inertia  (A?),  and  Chi-Square  (n\j)  for  the  Do-It- 
Yourself  Data  in  Table  15.13 


Singular  Value 

Inertia 

Chi-Square 

Percent 

.60707 

.36853 

3,446.5 

26.32 

.49477 

.24480 

2,289.4 

17.49 

.45591 

.20785 

1,943.9 

14.85 

.42704 

.18237 

1,705.5 

13.03 

.40516 

.16415 

1,535.2 

11.73 

.39392 

.15517 

1,451.2 

11.08 

.27771 

.07713 

721.3 

5.51 

Total 

1.40000 

13,092.9 

100 

Dimension  1 

Figure  15.11.  Plot  of  points  representing  the  12  categories  in  Table  15.13. 


Unskilled  employment  has  a  high  association  with  not  doing  one’s  own  repairs. 
Doing  one’s  own  repairs  is  associated  with  owning  a  house,  age  between  31  and  45, 
and  doing  office  work.  Living  in  an  apartment  is  associated  with  renting.  □ 
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15.3  BIPLOTS 
15.3.1  Introduction 

A  biplot  is  a  two-dimensional  representation  of  a  data  matrix  Y  [see  (3.17)]  showing 
a  point  for  each  of  the  n  observation  vectors  (rows  of  Y)  along  with  a  point  for  each 
of  the  p  variables  (columns  of  Y).  The  prefix  hi  refers  to  the  two  kinds  of  points; 
not  to  the  dimensionality  of  the  plot.  The  method  presented  here  could,  in  fact,  be 
generalized  to  a  three-dimensional  (or  higher-order)  biplot.  Biplots  were  introduced 
by  Gabriel  (1971)  and  have  been  discussed  at  length  by  Gower  and  Hand  (1996);  see 
also  Khattree  and  Naik  (2000),  Jacoby  (1998,  Chapter  7),  and  Seber  (1984,  pp.  204- 
212). 

If  p  =  2,  a  simple  scatter  plot,  as  in  Section  3.3,  has  both  kinds  of  information, 
namely,  a  point  for  each  observation  and  the  two  axes  representing  the  variables.  We 
can  see  at  a  glance  the  placement  of  the  points  relative  to  each  other  and  relative  to 
the  variables. 

When  p  >  2,  we  can  obtain  a  two-dimensional  plot  of  the  observations  by  plot¬ 
ting  the  first  two  principal  components  of  S  as  in  Section  12.4.  We  can  then  add 
a  representation  of  the  p  variables  to  the  plot  of  principal  components  to  obtain  a 
biplot.  The  principal  component  approach  is  discussed  in  Section  15.3.2.  A  method 
based  on  the  singular  value  decomposition  is  presented  in  Section  15.3.3,  and  other 
methods  are  reviewed  in  Section  15.3.5. 


15.3.2  Principal  Component  Plots 

A  principal  component  is  given  by  z  =  a'y,  where  a  is  an  eigenvector  of  S,  the  sample 
covariance  matrix,  and  y  is  a  p  x  1  observation  vector  (see  Section  12.2).  There 
are  p  eigenvectors  aj,  a2,  . . .  ,  ap,  and  thus  there  are  p  principal  components  zi, 
Z2 ,  •  •  •  ,  zP  for  each  observation  vector  y,- ,  i  —  1,2,...  ,  n.  Hence  (using  the  centered 
form)  the  observation  vectors  are  transformed  to  Zij  —  aJ  (y;  —  y)  =  (y;  —  y)'a;-, 
i  =  1,2,...  ,n\j  =  1,2,...  ,  p.  Each  p  x  I  observation  vector  y;  is  transformed 
to  a  p  x  1  vector  of  principal  components, 

x'i  =  [(y;  -  y)'at ,  (y;  -  y)'a2, . . .  ,  (y /  -  y)'ap] 

=  (y<  -y)'(at,a2,...  ,ap)  =  (y,-  -y)'A,  1  =  1,2, ...  ,n,  (15.49) 

where  A  =  (ai,  a?,  . . .  ,  ap)  is  the  p  x  p  matrix  whose  columns  are  (normalized) 
eigenvectors  of  S.  [Note  that  the  matrix  A  in  (15.49)  is  the  transpose  of  A  in  (12.3)]. 
With  Z  and  Yc  defined  as 


(  Z|  \ 

(  (si  -  y )'  \ 

z  = 

z2 

Y  c  = 

(y2  -  y)' 

l  <  ) 

•  1 

K 

>* 

(15.50) 
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[see  (10.13)],  we  can  express  the  principal  components  in  (15.49)  as 

Z  =  YCA.  (15.51) 

Since  the  eigenvectors  a  /  of  the  symmetric  matrix  S  are  mutually  orthogonal  (see 
Section  2.11.6),  A  =  (ai,  a2, . . .  ,  ap)  is  an  orthogonal  matrix  and  AA'  =  I.  Multi¬ 
plying  (15.51)  on  the  right  by  A',  we  obtain 

Yc  =  ZA'.  (15.52) 

The  best  two-dimensional  representation  of  Yc  is  given  by  taking  the  first  two 
columns  of  Z  and  the  first  two  columns  of  A.  If  the  resulting  matrices  are  denoted  by 
Z2  and  A2,  we  have 


Yc  =  Z2A;.  (15.53) 

The  fit  in  (15.53)  is  best  in  a  least  squares  sense.  If  the  left  side  of  (15.53)  is  rep¬ 
resented  by  Yc  =  B  =  (bjj)  and  the  right  side  by  Z2A2  —  C  —  ( Cjj ),  then 
^,Pj=i(bij  —  Cij )2  is  minimized  (Seber  1984,  p.  206). 

The  coordinates  for  the  n  observations  are  the  rows  of  Z2,  and  the  coordinates  for 
the  p  variables  are  the  rows  of  A2  (columns  of  A'2).  The  coordinates  are  discussed 
further  in  Section  15.3.4. 

The  adequacy  of  the  fit  in  (15.53)  can  be  evaluated  by  examining  the  first  two 
eigenvalues  /.\  and  A  2  of  S.  Thus  a  large  value  (close  to  1)  of 

Ai  +  A.  2 

TJU^i 

would  indicate  that  Yc  is  represented  well  visually  in  the  plot. 

15.3.3  Singular  Value  Decomposition  Plots 

We  can  also  obtain  Yc  =  ZA'  in  (15.52)  by  means  of  the  singular  value  decomposi¬ 
tion  of  Yc.  By  (2.117),  we  have 


Yc  =  UAV',  (15.54) 

where  A  =  diag(A] ,  X2, ...  ,  ).p)  is  a  diagonal  matrix  containing  square  roots  of  the 
(nonzero)  eigenvalues  Aj,  Aj, ...  ,  A~  of  Y'V<;  (and  of  Y,  Y'.),  the  columns  of  U  are 
the  corresponding  eigenvectors  of  YCY^.,  and  the  columns  of  V  are  the  corresponding 
eigenvectors  of  YJ.YC. 

The  product  UA  in  (15.54)  is  equal  to  Z,  the  matrix  of  principal  component 
scores  in  (15.51).  To  see  this  we  multiply  (15.54)  by  V,  which  is  orthogonal  because 
it  contains  the  (normalized)  eigenvectors  of  the  symmetric  matrix  Y'.YC  (see  Sec¬ 
tion  2.1 1.6).  This  gives 


YCV  =  UAV'V  =  UA. 


(15.55) 
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By  (10.17),  Y('.Y,  is  equal  to  (n  —  1)S.  By  (2.106),  eigenvectors  of  (n  —  1)S  are  also 
eigenvectors  of  S.  Thus  Y  is  the  same  as  A  in  (15.51),  which  contains  eigenvectors 
of  S.  Hence,  YCV  in  (15.55)  becomes 

YCV  =  YCA 

—  Z  [by  (15.51)] 

=  UA  [by  (15.55)]. 

We  can  therefore  write  (15.54)  as 

Yc  =  UAV'  =  ZV  =  ZA'.  (15.56) 

Thus  the  singular  value  decomposition  of  Yr  gives  the  same  factoring  as  the  expres¬ 
sion  in  (15.52)  based  on  principal  components. 


15.3.4  Coordinates 

In  this  section,  we  consider  the  coordinates  for  the  methods  of  Sections  15.3.2  and 
15.3.3.  Let  us  return  to  (15.53),  the  two-dimensional  representation  of  Yc  based  on 
principal  components  (which  is  the  same  representation  as  that  based  on  the  singular 
value  decomposition): 


yc  =  z2a;  = 


/ 

Zll 

Z12 

\ 

Z21 

Z22 

/nil  a2i  • 

’  '  @pl 

V  fl12  «22  • 

’  '  G.p2 

V 

Znl 

Zn2 

J 

(15.57) 


The  elements  of  (15.57)  are  of  the  form 


yij  ~  yj  =  znaj t  +  Zi2dj2,  i  =  1, 2, . . .  ,  n;  j  =  1, 2, . . .  ,  p. 


Thus  each  observation  is  represented  as  a  linear  combination,  the  coordinates  (coef¬ 
ficients)  being  the  elements  of  the  vector  (z,i  i ,  zn)  and  the  axes  being  the  elements 
of  the  vector  (a/i,  a/ 2).  We  therefore  plot  the  points  (zn,  zn ),  i  =  1,2,...,  n.  and 
the  points  (a}\ ,  a/2),  j  =  1,  2, ...  ,  p.  To  distinguish  them  and  to  show  relation¬ 
ship  of  the  points  to  the  axes,  the  points  (aj\,  «/2)  are  connected  to  the  origin  with 
a  straight  line  forming  an  arrow.  If  necessary,  the  scale  of  the  points  (aj\ ,  a  /2)  could 
be  adjusted  to  be  compatible  with  that  of  the  principal  components  (zi i ,  Zn)- 

The  Euclidean  distance  between  two  points  (zn,  zn)  and  (zk\ ,  Zkl)  is  approxi¬ 
mately  equal  to  the  distance  between  the  corresponding  points  (rows)  y'-  and  y'k  in 
the  data  matrix  Y.  If  all  of  the  principal  components  were  used,  as  in  (15.51)  and 
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(15.52),  the  distance  would  be  the  same,  but  with  only  two  principal  components, 
the  distance  is  an  approximation. 

The  cosine  of  the  angle  between  the  arrows  (lines)  drawn  to  each  pair  of  axis 
points  (cij  i ,  cij2)  and  (au\ ,  ciki)  shows  the  correlation  between  the  two  corresponding 
variables  [see  (3.14)  and  (3.15)].  Thus  a  small  angle  between  two  vectors  indicates 
that  the  two  variables  are  highly  correlated,  two  variables  whose  vectors  form  a  90° 
angle  are  uncorrelated,  and  an  angle  greater  than  90°  indicates  that  the  variables  are 
negatively  correlated. 

The  values  of  the  p  variables  in  the  i  th  observation  vector  y,  (corrected  for  means) 
are  related  to  the  perpendicular  projection  of  the  point  (zn,  Zii )  on  the  p  vectors  from 
the  origin  to  the  points  (aj  \ ,  a /2)  representing  variables.  The  further  from  the  origin 
a  projection  falls  on  an  arrow,  the  larger  the  value  of  the  observation  on  that  variable. 
Hence  the  vectors  will  be  oriented  toward  the  observations  that  have  larger  values  on 
the  corresponding  variables. 


Example  15.3.4.  Using  the  city  crime  data  of  Table  14.1,  we  illustrate  the  principal 
component  approach.  The  first  two  eigenvectors  of  the  sample  covariance  matrix  S 
are  given  by 


.002 

.008 

.017 

.014 

.182 

.689 

.104 

.221 

.747 

-.240 

.612 

-.109 

.153 

.638 

The  matrix  of  the  first  two  principal  components  is  given  by 
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Figure  15.12.  Principal  components  biplot  for  city  crime  data  in  Table  14.1. 


The  coordinates  for  the  16  cities  are  found  in  Z2,  and  the  coordinates  for  the  7  vari¬ 
ables  are  found  in  A2.  The  plot  of  the  city  and  variable  points  is  given  in  Figure  15.12. 
The  observation  points  are  spread  out,  whereas  the  variable  points  are  clustered 
tightly  around  the  origin.  Suitable  scaling  of  the  eigenvectors  in  A2  would  enable  the 
arrows  representing  the  variables  to  pass  through  the  points  (see  Example  15.3.5). 

□ 


15.3.5  Other  Methods 

The  singular  value  decomposition  of  Yc  is  given  in  (15.54)  as 

Yc  =  UAV.  (15.58) 

In  Section  15.3.3,  it  was  shown  that  UA  =  Z  and  V  =  A  [see  (15.56)],  so  that 
(15.58)  can  be  written  as 


Yc  =  (UA)V'  =  ZA', 


which  is  equivalent  to  the  principal  component  solution  Yc  =  ZA'  in  (15.52).  Alter¬ 
native  factorings  may  be  of  interest.  Two  that  have  been  considered  are 
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(15.59) 

(15.60) 


If  we  denote  the  submatrices  consisting  of  the  first  two  columns  of  U  and  Y  as 
U2  and  V2,  respectively,  and  define  A2  =  diagO,  1 ,  X2),  then  the  two-dimensional 
representations  of  (15.59)  and  (15.60)  are 


Yc  =  U2(A2V'2) 


/ 

MU 

«12  ^ 

M21 

U22 

(  ^tPti 

X[V2l  ■ 

■  ■  At  Dpi  \ 

V 

Un  1 

Mn2  / 

\  ^2Pl2 

X2V22  ■ 

■  •  ^2Vp2  ) 

Yc  =  (IfcA^XA^V) 


/  u  1 1  ~JXiu  12  \ 

~JX\U2\  V^2U22 

V  \A2Mn2  ) 


Vhvn  VIT^t 

VX2V12  VX 2V22 


(15.61) 


*yx  ]  v  p  t 

\TX2Vp2 


(15.62) 


For  the  biplot  corresponding  to  (15.61),  we  plot  the  set  of  points  (h;i,  nl2),  *  =  1, 
2,...  ,  7i,  and  the  set  of  points  (k\Vj  1,  A.2u22),  j  =  1,2,...  ,  p,  with  the  latter  points 
connected  to  the  origin  by  an  arrow  to  show  the  axes.  For  the  biplot  arising  from 
(15.62),  we  plot  the  set  of  points  (^/X7m,i  ,  ^/Xiun),  i  =  1,2, ,  n,  and  the  set  of 
points  {s/X\Vj  1,  s/X2Vj2),  j  =  1,2,...  ,  p,  with  the  latter  points  connected  to  the 
origin  with  an  arrow. 

The  presence  of  /,  1  and  /.2  in  (15.61)  and  (15.62)  provides  scaling  that  is  absent 
in  (15.57).  For  many  data  sets  the  scaling  in  (15.62)  will  be  adequate  with  no  further 
adjustment. 

If  we  write  (15.59)  in  the  form 


Yc  =  U(AV')  =  U(VA)'  =  UH',  (15.63) 

then 

uu'  =  YcS-'y;,  (15.64) 

HH'  =  (n  -  1)S  (15.65) 

(see  Problem  15.10).  With  suitable  scaling  of  the  eigenvectors  in  U  and  Y,  we  could 
eliminate  the  coefficients  involving  n  —  1  from  (15.64)  and  (15.65). 
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By  (15.64)  (with  scaling  to  eliminate  n  —  1),  the  (Euclidean)  distance  (u;  — 
U/t/lu,'  —  ua)  between  two  points  u,  and  ua  is  equal  to  the  Mahalanobis  distance 
(y i  —  yk)'S]  (ji  —  y I-)  between  the  corresponding  points  y;  and  yt-  in  the  data  matrix 
Y: 


(u ,•  -  UkYim  -  u k)  =  (y i  -  yA-)'S  1  (y;  -  y a)  (15.66) 

(see  Problem  15.1 1).  By  (15.65),  the  covariance  Sjk  between  the  jth  and  A:th  variables 
(columns  of  Y)  is  given  by 


sjk  =  h  'jhk,  (15.67) 

where  If  and  hj,  are  rows  of  H.  By  (3.14)  and  (3.15),  this  can  be  converted  to  the 
correlation 


h> 

rjk  —  cos  9  —  ■  (15.68) 

y(h;.h7.)(h'.h,) 

so  that  the  angle  between  the  two  vectors  h;  and  h/  is  related  to  the  correlation 
between  the  jth  and  Arth  variables. 

The  two-dimensional  representation  of  u;  and  h;  in  (15.61)  has  the  approximate 
Mahalanobis  distance  and  correlation  properties  discussed  earlier. 


Example  15.3.5.  Using  the  city  crime  data  of  Table  14.1,  we  illustrate  the  singular 
value  decomposition  method  with  the  factorings  in  (15.61)  and  (15.62).  The  matrices 
U2,  At,  and  V2  are 
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A2  =  diag(1503.604,  678.615). 
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By  (15.61),  the  two-dimensional  representation  is  given  by  plotting  the  rows  of  U2 
and  the  rows  of  V2A2  (or  the  columns  of  A2V2).  For  V2A2  we  have 
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V 
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The  plot  of  the  observation  points  and  variable  points  is  given  in  Figure  15.13. 
For  (15.62),  we  obtain 
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The  plot  of  these  coordinates  is  given  in  Figure  15.14.  For  this  data  set,  the  factoring 
given  by  (15.62)  in  Figure  15.14  is  preferred  because  it  plots  both  observation  and 
variable  points  on  the  same  scale.  The  factorings  shown  in  Figures  15.12  and  15.13 
would  need  an  adjustment  in  scaling.  □ 


PROBLEMS 

15.1  In  step  2  of  the  algorithm  for  metric  scaling  in  Section  15.1.2,  the  matrix  B  = 
(bjj)  is  defined  in  terms  of  A  =  (a,j)  as  bjj  —  aij  —di.—a.j  +a.. .  Show  that  b\j 
in  B  =  (I  —  ij)A(I—  ^J)  in  (15.2)  is  equivalent  to  bjj  =  atj  —  n,-.  —  ci.j  +  a ... 

15.2  Verify  the  result  stated  in  step  2  of  the  algorithm  in  Section  15.1.2,  namely, 
that  there  exists  a  (/-dimensional  configuration  zj,  Z2, . . .  ,  z„  such  that  dij  — 
S,  j  if  and  only  if  B  is  positive  semidefinite  of  rank  q.  Use  the  following 
approach. 
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(a)  Assuming  the  existence  of  zj,  zt,  . . .  ,  z„  such  that  Sf ,  =  cfr.  =  (z,-  — 
Zj)'(Zi  —  z  /),  show  that  B  is  positive  semidefinite. 

(b)  Assuming  B  is  positive  semidefinite,  show  that  there  exist  zj,  zj,  . . .  .  z„ 
such  that  dfj  =  (z;  -  z;)'(z;  -  Zj)  —  <$?.. 

15.3  (a)  Show  that  r  =  Y^j=i  Pici  *n  (15-20)  is  the  same  as  r  =  (pi  ,  p2., . . .  , 

Pa.y  in  (15.9). 

(b)  Show  that  c'  =  i  P>r',  'n  (15.21)  is  equivalent  to  c'  =  ( p  \ .  p.2, 
...  ,  p.6)  in  (15.10). 

15.4  Show  that  j'r  =  c'j  =  1  as  in  (15.22). 

15.5  Show  that  the  chi-square  statistic  in  ( 15.26)  is  equal  to  that  in  (15.25). 

15.6  (a)  Show  that  the  chi-square  statistic  in  (15.27)  is  equal  to  that  in  (15.25). 

(b)  Show  that  the  chi-square  statistic  in  (15.28)  is  equal  to  that  in  (15.25). 

15.7  (a)  Show  the  chi-square  statistic  in  (15.29)  is  equal  to  that  in  (15.27). 

(b)  Show  the  chi-square  statistic  in  (15.30)  is  equal  to  that  in  (15.28). 

15.8  (a)  Show  that  R  —  jc'  =  D“*  (P  —  re')  as  in  (15.41). 

(b)  Show  that  C  —  rj'  =  (P  —  re')  as  in  (15.42). 

15.9  Show  that  if  all  the  principal  components  were  used,  the  distance  between  z, 
and  z*  would  be  the  same  as  between  y;  and  y/_-,  as  noted  in  Section  15.3.4. 

15.10  (a)  Show  that  UU'  =  Y^S-1  Y'/(n  -  1)  as  in  (15.64). 

(b)  Show  that  HH'  =  (n  -  1)S  as  in  (15.65). 

15.11  Show  that  (u,  -  u*)'(uj  -  u*)  =  (y;  -  yt)'S_1(y ,•  -  yk )  as  in  (15.66). 

15.12  In  Table  15.15,  we  have  road  distances  between  major  UK  towns  (Hand  et  al. 
1994,  p.  346).  The  towns  are  as  follows: 

A  =  Aberdeen,  B  =  Birmingham,  C  =  Brighton,  D  =  Bristol,  E  =  Cardiff, 
F  =  Carlisle,  G  =  Dover,  H  =  Edinburgh,  1  =  Fort  William,  J  =  Glasgow, 
K  =  Holyhead,  L  =  Hull,  M  =  Inverness,  N  =  Leeds,  O  =  Liverpool, 
P  =  London,  Q  =  Manchester,  R  =  Newcastle,  S  =  Norwich,  T  =  Notting¬ 
ham,  U  =  Penzance,  V  =  Plymouth,  W  =  Sheffield. 

(a)  Find  the  matrix  B  as  in  (15.2). 

(b)  Using  the  spectral  decomposition,  find  the  first  two  columns  of  the  matrix 
Z  as  in  (15.4). 

(c)  Create  a  metric  multidimensional  scaling  plot  of  the  first  two  dimensions. 
What  do  you  notice  about  the  positions  of  the  cities? 

15.13  Zhang,  Helander,  and  Drury  (1996)  analyzed  a  43  x  43  similarity  matrix  for  43 
descriptors  of  comfort,  such  as  calm,  tingling,  restful,  etc.  For  the  similarity 
matrix,  see  the  Wiley  ftp  site  (Appendix  C). 

(a)  Carry  out  a  metric  multidimensional  scaling  analysis  and  plot  the  first  two 
dimensions.  What  pattern  is  seen  in  the  plot? 


> 

ON 

CM 

oo 

O 

p 

r- 

co 

, 

o 

'xf 

H 

CO 

nO 

CO 

CM 

O 

in 

*xf 

oo 

C/3 

CM 

CM 

in 

^f 

CO 

'xf 

O 

vo 

'xf 

04 

in 

VO 

oo 

CO 

CM 

,—l 

^f 

^f 

of 

in 

o 

On 

l> 

O' 

of 

oo 

l> 

in 

00 

CO 

1—1 

CO 

CM 

of 

«n 

in 

CM 

ON 

On 

O 

00 

»— H 

CO 

^f 

vo 

CM 

CM 

CO 

CM 

vo 

in 

in 

CM 

ON 

o 

O 

ON 

o 

i-H 

CO 

r- 

'xf 

o 

O 

r- 

CM 

CM 

CO 

CO 

in 

On 

of 

CO 

CO 

CO 

vo 

in 

in 

z 

r-~ 

On 

of 

On 

r- 

r*- 

o 

CO 

CO 

,—l 

^f 

CO 

CM 

CO 

oo 

OO 

CO 

ON 

vo 

m 

CM 

OO 

oo 

r- 

r-~ 

vo 

CM 

in 

CM 

CO 

CO 

in 

CO 

CM 

«n 

r- 

vo 

o 

o 

ON 

CM 

CO 

On 

oo 

l> 

P 

CO 

vo 

CO 

On 

ON 

^f 

in 

ON 

^f 

vo 

1—1 

CO 

CO 

oo 

oo 

in 

OO 

in 

00 

oo 

O 

CO 

CM 

On 

p 

CM 

oo 

vo 

o 

vo 

CM 

vo 

o 

r- 

CM 

in 

in 

CM 

’xf 

,_H 

’  1 

CM 

CM 

CO 

^f 

CO 

T“ 

CM 

CM 

CM 

of 

CM 

oo 

CM 

M" 

ON 

r** 

vo 

CO 

5 

<M 

r-~ 

r-» 

CM 

CM 

i-H 

in 

00 

OO 

VO 

On 

vo 

CO 

CM 

CM 

CM 

Of 

CM 

,-H 

CO 

CM 

in 

*xf 

CM 

CM 

oo 

CM 

in 

CO 

»n 

CO 

ON 

'xf 

O 

00 

'xf 

HH 

O 

CO 

OO 

VO 

CO 

CO 

CM 

CM 

'xf 

ON 

o 

r- 

O 

l> 

*^f 

CO 

CO 

CO 

in 

CO 

CM 

^f 

^f 

vo 

vo 

CO 

CO 

VO 

oo 

oo 

vo 

m 

CO 

ON 

00 

OO 

CO 

00 

vo 

O 

X 

CO 

CM 

CO 

»n 

o 

CM 

1-H 

o 

vo 

l> 

vo 

ON 

'xf 

*—* 

CO 

CM 

CM 

CM 

of 

CM 

CO 

CM 

•n 

*xf 

CM 

vo 

00 

t-* 

CO 

'xf 

t-- 

CM 

On 

ON 

CM 

O 

CM 

in 

o 

vD 

o 

On 

•n 

NO 

n 

f- 

O' 

r-- 

OO 

in 

r-* 

vD 

ON 

in 

vO 

'xf 

CO 

CM 

VO 

CM 

CO 

CM 

CO 

CM 

CO 

CM 

CM 

OO 

on 

Os 

oo 

oo 

CO 

oo 

CO 

m 

CO 

ON 

r-» 

in 

o 

00 

^f 

p 

On 

ON 

o 

On 

<N 

r- 

•n 

CM 

CM 

1-H 

»n 

oo 

ON 

vo 

ON 

vo 

co 

<M 

CM 

CM 

'  1 

CO 

CM 

'xf 

CO 

<M 

m 

CM 

in 

CO 

o 

O 

•n 

in 

CO 

o 

oo 

vo 

^f 

CM 

'xf 

p 

O' 

*xf 

O 

O 

o 

in 

nO 

ol- 

o 

in 

ON 

CM 

vo 

vo 

CO 

vo 

o 

co 

(N 

Tf 

•n 

Tf 

CM 

CM 

in 

CM 

CM 

,—l 

CO 

CM 

CM 

CM 

r- 

<M 

o 

r- 

CO 

o 

O 

•n 

o 

CM 

o 

CO 

vo 

vo 

»n 

'xf 

Q 

oo 

OO 

On 

OO 

CO 

CO 

"xf 

CM 

00 

CM 

l> 

o 

CO 

^f 

On 

CM 

oo 

CM 

CM 

CO 

'xf 

CO 

CM 

CM 

«n 

(N 

,_H 

CO 

CM 

o 

n 

OO 

OO 

oo 

r- 

r" 

CO 

CO 

vo 

xf 

CM 

o 

ON 

o 

On 

vo 

r- 

vo 

M- 

u 

r- 

o 

r- 

r- 

r- 

00 

r- 

CO 

00 

CO 

vo 

OO 

VO 

VO 

in 

vo 

On 

oo 

co 

CM 

co 

*xf 

•n 

CO 

CM 

VO 

CM 

CM 

CM 

CO 

CM 

CM 

CM 

in 

oo 

OO 

oo 

05 

oo 

r- 

t"- 

CO 

in 

VO 

m 

CM 

o 

On 

CM 

vo 

in 

CO 

vo 

CO 

oo 

oo 

o 

on 

o 

OS 

© 

On 

in 

CO 

in 

i-H 

o 

CM 

OO 

o 

r- 

in 

r- 

o 

oo 

,-H 

CM 

<M 

CM 

*xf 

,-H 

CM 

CM 

CM 

in 

in 

<M 

m 

VO 

On 

VO 

o 

vo 

in 

00 

vo 

CM 

r- 

r-* 

CM 

o 

vo 

•<3  co 

»— H 

co 

co 

On 

CM 

in 

^f 

vo 

vO 

o 

CO 

in 

of 

in 

co 

On 

O 

o 

CO 

r-~ 

*xf 

vO 

«n 

«n 

CM 

•n 

,-H 

^f 

CO 

CO 

CO 

«n 

CO 

CM 

”xf 

M- 

r-' 

vo 

CO 

co 

u 

O 

P 

F 

G 

H 

HH 

►"9 

K 

L 

M 

N 

O 

P 

Q 

R 

C/3 

H 

P 

V 

w 

541 


542 


GRAPHICAL  PROCEDURES 


Table  15.16.  Dissimilarity  Matrix  for  World  War  II  Politicians 


Person 

Hitler 

Mussolini 

Churchill 

Eisenhower 

Hitler 

0 

5 

11 

15 

Mussolini 

5 

0 

14 

16 

Churchill 

11 

14 

0 

7 

Eisenhower 

15 

16 

7 

0 

Stalin 

8 

13 

11 

16 

Attlee 

17 

18 

11 

16 

Franco 

5 

3 

12 

14 

De  Gaulle 

10 

11 

5 

8 

Mao  Tse 

16 

18 

16 

17 

Truman 

17 

18 

8 

6 

Chamberlain 

12 

14 

10 

7 

Tito 

16 

17 

8 

12 

Stalin 

Attlee 

Franco 

De  Gaulle 

Hitler 

8 

17 

5 

10 

Mussolini 

13 

18 

3 

11 

Churchill 

11 

11 

12 

5 

Eisenhower 

16 

16 

14 

8 

Stalin 

0 

15 

13 

11 

Attlee 

15 

0 

16 

12 

Franco 

13 

16 

0 

9 

De  Gaulle 

11 

12 

9 

0 

Mao  Tse 

12 

16 

17 

13 

Truman 

14 

12 

16 

9 

Chamberlain 

16 

9 

10 

11 

Tito 

12 

13 

12 

7 

Mao  Tse 

Truman 

Chamberlain 

Tito 

Hitler 

16 

17 

12 

16 

Mussolini 

18 

18 

14 

17 

Churchill 

16 

8 

10 

8 

Eisenhower 

17 

6 

7 

12 

Stalin 

12 

14 

16 

12 

Attlee 

16 

12 

9 

13 

Franco 

17 

16 

10 

12 

De  Gaulle 

13 

9 

11 

7 

Mao  Tse 

0 

12 

17 

10 

Truman 

12 

0 

9 

11 

Chamberlain 

17 

9 

0 

15 

Tito 

10 

11 

15 

0 
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(b)  For  an  alternative  approach,  carry  out  a  cluster  analysis  of  the  configura¬ 
tion  of  points  found  in  part  (a),  using  Ward’s  method.  Create  a  dendro¬ 
gram  of  the  cluster  solution.  How  many  clusters  are  indicated? 

15.14  Use  the  politics  data  of  Table  15.16  (Everitt  1987,  Table  6.7).  Two  subjects 

assessed  the  degree  of  dissimilarity  between  World  War  II  politicians.  The 

data  matrix  represents  the  sum  of  the  dissimilarities  between  the  two  subjects. 

(a)  For  k  =  6,  create  an  initial  configuration  of  points  by  choosing  12  random 
observations  taken  from  a  multivariate  normal  distribution  with  mean 
vector  0  and  covariance  matrix  If. 

(b)  Carry  out  a  nonmetric  multidimensional  scaling  analysis  using  the  seeds 
found  in  part  (a).  Find  the  value  of  the  STRESS  statistic. 

(c)  Repeat  parts  (a)  and  (b)  for  k  =  1 , . . .  ,5.  Plot  the  STRESS  values  against 
the  values  of  k.  How  many  dimensions  should  be  kept?  Plot  the  final 
configuration  of  points  with  two  dimensions. 

(d)  Repeat  parts  (a)-(c)  using  an  initial  configuration  of  points  from  a  mul¬ 
tivariate  normal  with  different  mean  vector  and  covariance  matrix  from 
those  in  part  (a).  How  many  dimensions  should  be  kept?  Plot  the  final 
configuration  of  points  with  two  dimensions.  How  does  this  solution  com¬ 
pare  to  that  in  part  (c)? 

(e)  Repeat  parts  (a)-(c)  using  an  initial  configuration  of  points  from  a  uni¬ 
form  distribution  over  (0,  1).  How  many  dimensions  should  be  kept?  Plot 
the  final  configuration  of  points  with  two  dimensions. 

(f)  Repeat  part  (e)  using  as  initial  configuration  of  points  the  metric  multi¬ 
dimensional  scaling  solution  found  by  treating  the  ordinal  measurements 
as  continuous.  How  many  dimensions  should  be  kept?  Plot  the  final  con¬ 
figuration  of  points  with  two  dimensions. 

Table  15.17.  Birth  and  Death  Months  of  1281  People 


Publisher's  Note: 
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15.15  In  Table  15.17  we  have  the  months  of  birth  and  death  for  1281  people 
(Andrews  and  Herzberg  1985,  Table  71.2). 

(a)  Find  the  correspondence  matrix  P  as  in  ( 15.8). 

(b)  Find  the  matrices  R  and  C,  as  in  (15.15)  and  (15.19). 

(c)  Perform  a  chi-square  test  for  independence  between  birth  and  death 
months. 

(d)  Plot  the  row  and  column  deviations  as  in  Example  15.2.5(a). 

15.16  In  Table  15.18,  we  have  a  cross-classification  of  crimes  in  Norway  in  1984 
categorized  by  type  and  site  (Clausen  1998,  p.  9). 


Table  15.18.  Crimes  by  Type  and  Site 


Part  of  Country 

Burglary 

Fraud 

Vandalism 

Total 

Oslo  area 

395 

2456 

1758 

4609 

Mid  Norway 

147 

153 

916 

1216 

North  Norway 

694 

327 

1347 

2368 

Total 

1236 

2936 

4021 

8193 

(a)  Find  the  correspondence  matrix  P  as  in  ( 15.8). 

(b)  Find  the  matrices  R  and  C  as  in  (15.15)  and  (15.19). 

(c)  Perform  a  chi-square  test  for  independence  between  type  of  crime  and 
site. 

(d)  Plot  the  row  and  column  deviations  as  in  Example  15.2.4. 

15.17  In  Table  15.19,  we  have  a  six-way  contingency  table  (Andrews  and  Herzberg 

1985,  Table  34.1).  Carry  out  a  multiple  correspondence  analysis. 

(a)  Set  up  an  indicator  matrix  G  and  find  the  Burt  matrix  G'G. 

(b)  Perform  a  correspondence  analysis  on  the  Burt  matrix  found  in  part  (a) 
and  plot  the  coordinates. 

(c)  What  associations  are  present? 

15.18  Use  the  protein  consumption  data  of  Table  14.7. 

(a)  Create  a  biplot  using  the  principal  component  approach  in  (15.53)  or 
(15.57). 

(b)  Create  a  biplot  using  the  singular  value  decomposition  approach  with  the 
factoring  as  in  (15.61). 

(c)  Create  a  biplot  using  the  singular  value  decomposition  approach  with  the 
factoring  as  in  (15.62). 

(d)  Which  of  the  three  biplots  best  represents  the  data? 

15.19  Use  the  perception  data  of  Table  13.1. 
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Table  15.19.  Byssinosis  Data 
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Table  15.19.  (Continued) 
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(a)  Create  a  biplot  using  the  principal  component  approach  in  (15.53)  or 
(15.57). 

(b)  Create  a  biplot  using  the  singular  value  decomposition  approach  with  the 
factoring  as  in  (15.61). 

(c)  Create  a  biplot  using  the  singular  value  decomposition  approach  with  the 
factoring  as  in  (15.62). 

(d)  Which  of  the  three  biplots  best  represents  the  data? 

15.20  Use  the  cork  data  of  Table  6.21. 

(a)  Create  a  biplot  using  the  principal  component  approach  in  (15.53)  or 
(15.57). 

(b)  Create  a  biplot  using  the  singular  value  decomposition  approach  with  the 
factoring  as  in  (15.61). 

(c)  Create  a  biplot  using  the  singular  value  decomposition  approach  with  the 
factoring  as  in  (15.62). 

(d)  Which  of  the  three  biplots  best  represents  the  data? 
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Table  A.l.  Upper  Percentiles  for  ~Jb\ 


rr  V«E;=i  (y, -y)3 

‘  [ZUiy.-Wf'2 

The  sampling  distribution  of  N fb[  is  symmetric  about  zero,  and  lower  percentage  points  cor¬ 
responding  to  negative  skewness  are  given  by  the  negative  of  the  table  values.  Reject  the 
hypothesis  of  normality  if  N fb[  is  greater  than  the  table  value  or  less  than  the  negative  of  the 
table  value. 

Upper  Percentiles 

n 

10 

5 

2.5 

1 

.5 

.1 

4 

.831 

.987 

1.070 

1.120 

1.137 

1.151 

5 

.821 

1.049 

1.207 

1.337 

1.396 

1.464 

6 

.795 

1.042 

1.239 

1.429 

1.531 

1.671 

7 

.782 

1.018 

1.230 

1.457 

1.589 

1.797 

8 

.765 

.998 

1.208 

1.452 

1.605 

1.866 

9 

.746 

.977 

1.184 

1.433 

1.598 

1.898 

10 

.728 

.954 

1.159 

1.407 

1.578 

1.906 

11 

.710 

.931 

1.134 

1.381 

1.553 

1.899 

12 

.693 

.910 

1.109 

1.353 

1.526 

1.882 

13 

.677 

.890 

1.085 

1.325 

1.497 

1.859 

14 

.662 

.870 

1.061 

1.298 

1.468 

1.832 

15 

.648 

.851 

1.039 

1.272 

1.440 

1.803 

16 

.635 

.834 

1.018 

1.247 

1.412 

1.773 

17 

.622 

.817 

.997 

1.222 

1.385 

1.744 

18 

.610 

.801 

.978 

1.199 

1.359 

1.714 

19 

.599 

.786 

.960 

1.176 

1.334 

1.685 

20 

.588 

.772 

.942 

1.155 

1.310 

1.657 

21 

.578 

.758 

.925 

1.134 

1.287 

1.628 

22 

.568 

.746 

.909 

1.114 

1.265 

1.602 

23 

.559 

.733 

.894 

1.096 

1.243 

1.575 

24 

.550 

.722 

.880 

1.078 

1.223 

1.550 

25 

.542 

.710 

.866 

1.060 

1.203 

1.526 

549 
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Table  A.2.  Coefficients  for  Transforming  -Jb\ to  a  Standard  Normal 


n 

S 

1/A 

n 

S 

1A 

62 

3.389 

1.0400 

64 

3.420 

1.0449 

8 

5.563 

.3030 

66 

3.450 

1.0495 

9 

4.260 

.4080 

68 

3.480 

1.0540 

10 

3.734 

.4794 

70 

3.510 

1.0581 

11 

3.447 

.5339 

72 

3.540 

1.0621 

12 

3.270 

.5781 

74 

3.569 

1.0659 

13 

3.151 

.6153 

76 

3.599 

1.0695 

14 

3.069 

.6473 

78 

3.628 

1.0730 

15 

3.010 

.6753 

80 

3.657 

1.0763 

16 

2.968 

.7001 

82 

3.686 

1.0795 

17 

2.937 

.7224 

84 

3.715 

1.0825 

18 

2.915 

.7426 

86 

3.744 

1.0854 

19 

2.900 

.7610 

88 

3.772 

1.0882 

20 

2.890 

.7779 

90 

3.801 

1.0909 

21 

2.884 

.7934 

92 

3.829 

1.0934 

22 

2.882 

.8078 

94 

3.857 

1.0959 

23 

2.882 

.8211 

86 

3.885 

1.0983 

24 

2.884 

.8336 

98 

3.913 

1.1006 

25 

2.889 

.8452 

100 

3.940 

1.1028 

26 

2.895 

.8561 

105 

4.009 

1.1080 

27 

2.902 

.8664 

110 

4.076 

1.1128 

28 

2.910 

.8760 

115 

4.142 

1.1172 

29 

2.920 

.8851 

120 

4.207 

1.1212 

30 

2.930 

.8938 

125 

4.272 

1.1250 

31 

2.941 

.9020 

130 

4.336 

1.1285 

32 

2.952 

.9097 

135 

4.398 

1.1318 

33 

2.964 

.9171 

140 

4.460 

1.1348 

34 

2.977 

.9241 

145 

4.521 

1.1377 

35 

2.990 

.9308 

150 

4.582 

1.1403 

36 

3.003 

.9372 

155 

4.641 

1.1428 

37 

3.016 

.9433 

160 

4.700 

1.1452 

38 

3.030 

.9492 

165 

4.758 

1.1474 

39 

3.044 

.9548 

170 

4.816 

1.1496 

40 

3.058 

.9601 

175 

4.873 

1.1516 

41 

3.073 

.9653 

180 

4.929 

1.1535 

42 

3.087 

.9702 

185 

1.985 

1.1553 

43 

3.102 

.9750 

190 

5.040 

1.1570 

44 

3.117 

.9795 

195 

5.094 

1.1586 

45 

3.131 

.9840 

200 

5.148 

1.1602 

46 

3.146 

.9882 

205 

5.202 

1.1616 

47 

3.161 

.9923 

210 

5.255 

1.1631 

48 

3.176 

.9963 

215 

5.307 

1.1644 

49 

3.192 

1.0001 

220 

5.359 

1.1657 

50 

3.207 

1.0038 

225 

5.410 

1.1669 
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Table  A.2.  ( Continued ) 


52 

3.237 

1.0108 

54 

3.268 

1.0174 

56 

3.298 

1.0235 

58 

3.329 

1.0293 

60 

3.359 

1.0348 

230 

235 

■KgS 

240 

5.561 

1.1704 

245 

5.611 

1.1714 

250 

5.660 

1.1724 

Values  of  S  and  l/X  are  such  that  g(*/bi)  =  S  sinh  1  (y/bi/X)  is  approximately  ATO.  1). 


Table  A.3.  Percentiles  for  b2 


Upper  and  lower  percentiles  for 


b  »E;U(y,-y)4 

[E"=t(y/  -y)2]2’ 

the  sample  coefficient  of  kurtosis.  Reject  the  hypothesis  of  normality  if  b2  is  greater  than  an 
upper  percentile  or  less  than  a  lower  percentile. 


Percentiles 

Sample  _ 


2 

2.5 

5 

10 

20 

80 

90 

95 

97.5 

98 

99 

7 

1.25 

1.30 

1.34 

1.41 

1.53 

1.70 

2.78 

3.20 

3.55 

3.85 

3.93 

4.23 

8 

1.31 

1.37 

1.40 

1.46 

1.58 

1.75 

2.84 

3.31 

3.70 

4.09 

4.20 

4.53 

9 

1.35 

1.42 

1.45 

1.53 

1.63 

1.80 

2.98 

3.43 

3.86 

4.28 

4.41 

4.82 

10 

1.39 

1.45 

1.49 

1.56 

1.68 

1.85 

3.01 

3.53 

3.95 

4.40 

4.55 

5.00 

12 

1.46 

1.52 

1.56 

1.64 

1.76 

1.93 

3.06 

3.55 

4.05 

4.56 

4.73 

5.20 

15 

1.55 

1.61 

1.64 

1.72 

1.84 

2.01 

3.13 

3.62 

4.13 

4.66 

4.85 

5.30 

20 

1.65 

1.71 

1.74 

1.82 

1.95 

2.13 

3.21 

3.68 

4.17 

4.68 

4.87 

5.36 

25 

1.72 

1.79 

1.83 

1.91 

2.03 

2.20 

3.23 

3.68 

4.16 

4.65 

4.82 

5.30 

30 

1.79 

1.86 

1.90 

1.98 

2.10 

2.26 

3.25 

3.68 

4.11 

4.59 

4.75 

5.21 

35 

1.84 

1.91 

1.95 

2.03 

2.14 

2.31 

3.27 

3.68 

4.10 

4.53 

4.68 

5.13 

40 

1.89 

1.96 

1.98 

2.07 

2.19 

2.34 

3.28 

3.67 

4.06 

4.46 

4.61 

5.04 

45 

1.93 

2.00 

2.03 

2.11 

2.22 

2.37 

3.28 

3.65 

4.00 

4.39 

4.52 

4.94 

50 

1.95 

2.03 

2.06 

2.15 

2.25 

2.41 

3.28 

3.62 

3.99 

4.33 

4.45 

4.88 
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Table  A.4.  Percentiles  for  D’Agostino’s  Test  for  Normality 

Upper  and  lower  percentiles  for  the  statistic 

Y  MD  -  (2 yfr)-1] 

.02998598 

where 

D=  EL i  ['  - ;(« +  r>] 

V^ELt(y.  -  J)2 

and  the  observations  yi,  >’2,  ...  ,  y„  are  ordered  as  y(i)  <  y@)  <  ■  ■  ■  <  y(„ ).  Reject  the  hypoth¬ 
esis  of  normality  if  K  is  greater  than  an  upper  percentile  or  less  than  a  lower  percentile. 

Percentiles  of  Y 


n 

.5 

1.0 

2.5 

5 

10 

90 

95 

97.5 

99 

99.5 

10 

-4.66 

-4.06 

-3.25 

-2.62 

-1.99 

.149 

.235 

.299 

.356 

.385 

12 

-4.63 

-4.02 

-3.20 

-2.58 

-1.94 

.237 

.329 

.381 

.440 

.479 

14 

-4.57 

-3.97 

-3.16 

-2.53 

-1.90 

.308 

.399 

.460 

.515 

.555 

16 

-4.52 

-3.92 

-3.12 

-2.50 

-1.87 

.367 

.459 

.526 

.587 

.613 

18 

-4.47 

-3.87 

-3.08 

-2.47 

-1.85 

.417 

.515 

.574 

.636 

.667 

20 

-4.41 

-3.83 

-3.04 

-2.44 

-1.82 

.460 

.565 

.628 

.690 

.720 

22 

-4.36 

-3.78 

-3.01 

-2.41 

-1.81 

.497 

.609 

.677 

.744 

.775 

24 

-4.32 

-3.75 

-2.98 

-2.39 

-1.79 

.530 

.648 

.720 

.783 

.822 

26 

-4.27 

-3.71 

-2.96 

-2.37 

-1.77 

.559 

.682 

.760 

.827 

.867 

28 

-4.23 

-3.68 

-2.93 

-2.35 

-1.76 

.586 

.714 

.797 

.868 

.910 

30 

-4.19 

-3.64 

-2.91 

-2.33 

-1.75 

.610 

.743 

.830 

.906 

.941 

32 

-4.16 

-3.61 

-2.88 

-2.32 

-1.73 

.631 

.770 

.862 

.942 

.983 

34 

-4.12 

-3.59 

-2.86 

-2.30 

-1.72 

.651 

.794 

.891 

.975 

1.02 

36 

-4.09 

-3.56 

-2.85 

-2.29 

-1.71 

.669 

.816 

.917 

1.00 

1.05 

38 

-4.06 

-3.54 

-2.83 

-2.28 

-1.70 

.686 

.837 

.941 

1.03 

1.08 

40 

-4.03 

-3.51 

-2.81 

-2.26 

-1.70 

.702 

.857 

.964 

1.06 

1.11 

42 

-4.00 

-3.49 

-2.80 

-2.25 

-1.69 

.716 

.875 

.986 

1.09 

1.14 

44 

-3.98 

-3.47 

-2.78 

-2.24 

-1.68 

.730 

.892 

1.01 

1.11 

1.17 

46 

-3.95 

-3.45 

-2.77 

-2.23 

-1.67 

.742 

.908 

1.02 

1.13 

1.19 

48 

-3.93 

-3.43 

-2.75 

-2.22 

-1.67 

.754 

.923 

1.04 

1.15 

1.22 

50 

-3.91 

-3.41 

-2.74 

-2.21 

-1.66 

.765 

.937 

1.06 

1.18 

1.24 

60 

-3.81 

-3.34 

-2.68 

-2.17 

-1.64 

.812 

.997 

1.13 

1.26 

1.34 

70 

-3.73 

-3.27 

-2.64 

-2.14 

-1.61 

.849 

1.05 

1.19 

1.33 

1.42 

80 

-3.67 

-3.22 

-2.60 

-2.11 

-1.59 

.878 

1.08 

1.24 

1.39 

1.48 

90 

-3.61 

-3.17 

-2.57 

-2.09 

-1.58 

.902 

1.12 

1.28 

1.44 

1.54 

100 

-3.57 

-3.14 

-2.54 

-2.07 

-1.57 

.923 

1.14 

1.31 

1.48 

1.59 

150 

-3.409 

-3.009 

-2.452 

-2.004 

-1.520 

.990 

1.233 

1.423 

1.623 

1.746 

200 

-3.302 

-2.922 

-2.391 

-1.960 

-1.491 

1.032 

1.290 

1.496 

1.715 

1.853 

250 

-3.227 

-2.861 

-2.348 

-1.926 

-1.471 

1.060 

1.328 

1.545 

1.779 

1.927 
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Table  A.5.  Upper  Percentiles  for  b ltP  and  Upper  and  Lower  Percentiles  for  b2,p. 

Reject  the  hypothesis  of  multivariate  normality  if  blp  is  greater  than  table  value.  Reject  if  b2  p  is  greater  than  upper  percentile  or  if  b2tP  is  less  than 
lower  percentile.  The  statistics  b\tP  and  b2p  are  defined  in  Section  4.4.2. 


P  =  2 

Upper  Percentiles  for  b\tP 


P  =  2 

Upper  and  Lower  Percentiles  for  b2tP 


Percentiles  Percentiles 


n 

90 

92.5 

95 

97.5 

99 

99.9 

n 

1 

2.5 

5 

10 

90 

95 

97.5 

99 

10 

2.994 

3.263 

3.694 

4.294 

5.194 

6.994 

10 

4.580 

4.722 

4.887 

5.057 

8.606 

9.203 

9.781 

10.378 

12 

2.681 

2.944 

3.319 

3.931 

4.938 

6.744 

12 

4.732 

4.899 

5.053 

5.232 

8.947 

9.593 

10.150 

10.881 

14 

2.419 

2.669 

3.031 

3.619 

4.581 

6.419 

14 

4.842 

5.015 

5.179 

5.358 

9.162 

9.769 

10.375 

11.159 

16 

2.219 

2.444 

2.775 

3.337 

4.231 

6.062 

16 

4.977 

5.149 

5.318 

5.482 

9.331 

9.941 

10.562 

11.387 

18 

2.050 

2.256 

2.556 

3.100 

3.962 

5.737 

18 

5.045 

5.219 

5.382 

5.555 

9.403 

10.005 

10.628 

11.478 

20 

1.894 

2.081 

2.356 

2.881 

3.669 

5.425 

20 

5.175 

5.262 

5.533 

5.717 

9.469 

10.114 

10.691 

11.609 

25 

1.581 

1.744 

1.969 

2.438 

3.106 

4.719 

25 

5.351 

5.525 

5.689 

5.871 

9.503 

10.159 

10.584 

11.628 

30 

1.363 

1.513 

1.687 

2.094 

2.681 

4.238 

30 

5.518 

5.692 

5.855 

6.038 

9.516 

10.156 

10.556 

11.594 

40 

1.050 

1.181 

1.319 

1.606 

2.087 

3.369 

40 

5.703 

5.871 

6.139 

6.229 

9.497 

10.109 

10.563 

11.453 

50 

.862 

.969 

1.069 

1.306 

1.744 

2.706 

50 

5.909 

6.083 

6.239 

6.403 

9.453 

9.987 

10.372 

11.181 

60 

.731 

.819 

.906 

1.094 

1.444 

2.200 

60 

6.015 

6.189 

6.335 

6.505 

9.401 

9.889 

10.250 

10.994 

70 

.631 

.725 

.794 

.937 

1.244 

1.863 

70 

6.139 

6.290 

6.437 

6.602 

9.356 

9.781 

10.106 

10.753 

80 

.544 

.637 

.694 

.812 

1.056 

1.587 

80 

6.223 

6.372 

6.539 

6.683 

9.309 

9.694 

9.981 

10.537 

90 

.487 

.569 

.638 

.725 

.919 

1.400 

90 

6.332 

6.475 

6.622 

6.749 

9.256 

9.688 

9.885 

10.325 

100 

.438 

.506 

.581 

.656 

.831 

1.231 

100 

6.389 

6.521 

6.665 

6.793 

9.210 

9.556 

9.806 

10.188 

150 

.281 

.344 

.400 

.444 

.531 

.794 

150 

6.615 

6.749 

6.858 

6.972 

9.027 

9.300 

9.475 

10.253 

200 

.219 

.269 

.300 

.331 

.394 

.569 

200 

6.761 

6.889 

6.979 

7.083 

8.919 

9.141 

9.269 

9.506 

300 

.144 

.169 

.209 

.225 

.256 

.369 

300 

6.949 

7.052 

7.142 

7.245 

8.776 

8.916 

9.031 

9.219 

400 

.116 

.129 

.141 

.166 

.197 

.275 

400 

7.079 

7.171 

7.252 

7.342 

8.664 

8.787 

8.917 

9.061 

600 

.077 

.085 

.094 

.110 

.131 

.183 

600 

7.232 

7.295 

7.369 

7.464 

8.547 

8.647 

8.749 

8.874 

(continued) 
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Table  A.5.  ( Continued ) 


P  =  2 

Upper  Percentiles  for  b\tP 
Percentiles 


n 

90 

92.5 

95 

97.5 

99 

99.9 

n 

800 

.058 

.064 

.071 

.083 

.099 

.137 

800 

1000 

.046 

.051 

.057 

.066 

.079 

.110 

1000 

1500 

.031 

.034 

.038 

.044 

.053 

.074 

1500 

2500 

.019 

.021 

.023 

.027 

.032 

.044 

2000 

3000 

.016 

.017 

.019 

.022 

.027 

.037 

2500 

4000 

.012 

.013 

.014 

.017 

.020 

.028 

3000 

5000 

.009 

.010 

.011 

.013 

.016 

.022 

4000 

5000 


P  =  3 

Upper  Percentiles  for  biiP 
Percentiles 


n 

90 

92.5 

95 

97.5 

99 

99.9 

n 

10 

6.0 

6.5 

6.9 

7.7 

8.8 

11.5 

10 

12 

5.5 

5.9 

6.4 

7.1 

8.1 

10.5 

12 

14 

5.0 

5.4 

5.9 

6.5 

7.4 

9.7 

14 

16 

4.6 

4.9 

5.4 

6.1 

6.8 

8.9 

16 

18 

4.2 

4.6 

5.1 

5.6 

6.4 

8.3 

18 

20 

3.9 

4.2 

4.7 

5.3 

6.0 

7.7 

20 

25 

3.3 

3.5 

3.9 

4.5 

5.2 

6.5 

25 

30 

2.8 

3.0 

3.3 

3.9 

4.4 

5.6 

30 

40 

2.2 

2.4 

2.7 

3.0 

3.5 

4.2 

40 

50 

1.7 

1.9 

2.2 

2.4 

2.8 

3.4 

50 

P  =  2 

Upper  and  Lower  Percentiles  for  7>2iP 
Percentiles 


1 

2.5 

5 

10 

90 

95 

97.5 

99 

7.304 

7.372 

7.451 

7.536 

8.472 

8.562 

8.641 

8.747 

7.367 

7.433 

7.504 

7.585 

8.419 

8.497 

8.569 

8.656 

7.460 

7.537 

7.595 

7.661 

8.339 

8.405 

8.463 

8.532 

7.535 

7.599 

7.649 

7.707 

8.293 

8.351 

8.401 

8.461 

7.588 

7.641 

7.686 

7.738 

8.262 

8.314 

8.359 

8.412 

7.624 

7.673 

7.714 

7.760 

8.240 

8.286 

8.327 

8.376 

7.674 

7.716 

7.752 

7.793 

8.207 

8.248 

8.284 

8.326 

7.709 

7.746 

7.778 

7.714 

8.186 

8.222 

8.254 

8.291 

P  =  3 

Upper  and  Lower  Percentiles  for  fc2iP 
Percentiles 


1 

2.5 

5 

10 

90 

95 

97.5 

99 

10.0 

10.2 

10.4 

10.7 

14.0 

14.4 

15.0 

15.6 

10.2 

10.4 

10.7 

11.0 

14.7 

15.2 

15.9 

16.4 

10.4 

10.6 

10.9 

11.3 

15.1 

15.8 

16.5 

17.1 

10.5 

10.8 

11.1 

11.5 

15.4 

16.1 

16.8 

17.5 

10.7 

11.0 

11.3 

11.6 

15.5 

16.4 

17.1 

17.8 

10.8 

11.1 

11.4 

11.8 

15.7 

16.5 

17.2 

18.0 

11.1 

11.4 

11.8 

12.1 

15.9 

16.7 

17.4 

18.2 

11.3 

11.6 

12.0 

12.3 

16.0 

16.7 

17.5 

18.3 

11.7 

12.0 

12.4 

12.7 

16.1 

16.7 

17.4 

18.2 

11.9 

12.3 

12.6 

12.9 

16.1 

16.7 

17.3 

18.0 

(continued) 
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Table  A.5.  ( Continued ) 


p  =  3  p  =  3 

Upper  Percentiles  for  bl  p  Upper  and  Lower  Percentiles  for 

Percentiles  Percentiles 


n 

90 

92.5 

95 

97.5 

99 

99.9 

n 

1 

2.5 

5 

10 

90 

95 

97.5 

99 

60 

1.5 

1.6 

1.8 

2.0 

2.4 

2.9 

60 

12.1 

12.5 

12.8 

13.1 

16.1 

16.6 

17.2 

17.9 

70 

1.3 

1.4 

1.5 

1.7 

2.0 

2.5 

70 

12.3 

12.6 

13.0 

13.2 

16.1 

16.6 

17.1 

17.7 

80 

1.13 

1.2 

1.3 

1.5 

1.7 

2.2 

80 

12.4 

12.8 

13.1 

13.3 

16.1 

16.5 

17.0 

17.6 

90 

1.01 

1.08 

1.16 

1.3 

1.5 

1.9 

90 

12.5 

12.9 

13.2 

13.5 

16.0 

16.5 

16.9 

17.5 

100 

.92 

.97 

1.05 

1.18 

1.3 

1.7 

100 

12.6 

13.0 

13.3 

13.5 

16.0 

16.4 

16.8 

17.4 

150 

.62 

.66 

.71 

.80 

.90 

1.15 

150 

13.0 

13.3 

13.6 

13.8 

15.9 

16.2 

16.5 

17.0 

200 

.47 

.50 

.54 

.60 

.68 

.87 

200 

13.2 

13.5 

13.8 

14.0 

15.8 

16.1 

16.3 

16.8 

300 

.32 

.33 

.36 

.40 

.46 

.58 

300 

13.6 

13.8 

14.0 

14.2 

15.7 

15.9 

16.1 

16.5 

400 

.237 

.252 

.272 

.30 

.34 

.44 

400 

13.7 

13.9 

14.1 

14.3 

15.6 

15.8 

16.0 

16.3 

600 

.159 

.168 

.182 

.203 

.230 

.294 

600 

13.9 

14.1 

14.3 

14.4 

15.51 

15.67 

15.81 

15.97 

800 

.119 

.127 

.137 

.153 

.173 

.221 

800 

14.1 

14.2 

14.3 

14.5 

15.45 

15.59 

15.71 

15.85 

1000 

.095 

.010 

.109 

.122 

.139 

.177 

1000 

14.17 

14.30 

14.41 

14.53 

15.41 

15.53 

15.64 

15.77 

1500 

.064 

.068 

.073 

.082 

.093 

.118 

1500 

14.33 

14.43 

14.52 

14.62 

15.34 

15.44 

15.53 

15.63 

2000 

.048 

.051 

.055 

.061 

.069 

.089 

2000 

14.42 

14.51 

14.58 

14.67 

15.30 

15.39 

15.46 

15.55 

3000 

.032 

.034 

.037 

.041 

.046 

.059 

3000 

14.53 

14.60 

14.66 

14.73 

15.25 

15.32 

15.38 

15.45 

4000 

.024 

.025 

.027 

.031 

.035 

.044 

4000 

14.59 

14.65 

14.71 

14.77 

15.21 

15.28 

15.33 

15.39 

5000 

.019 

.020 

.022 

.025 

.028 

.035 

5000 

14.63 

14.69 

14.74 

14.80 

15.19 

15.25 

15.30 

15.35 

P  =  4 

P  =  4 

Upper  Percentiles  for  b\]P 

Upper  and  Lower  Percentiles  for  b2,p 

Percentiles 

Percentiles 

n 

90 

92.5 

95 

97.5 

99 

99.9 

n 

1 

2.5 

5 

10 

90 

95 

97.5 

99 

10 

11.1 

11.6 

12.2 

13.3 

15.3 

17.9 

10 

17.0 

17.3 

17.6 

17.8 

21.5 

22.4 

23.0 

24.0 

12 

10.1 

10.6 

11.2 

12.2 

13.9 

16.2 

12 

17.4 

17.7 

18.0 

18.3 

22.3 

23.3 

24.2 

25.4 

(continued) 
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Table  A.5.  ( Continued ) 


p  =  4  p  -  4 

Upper  Percentiles  for  b\ Upper  and  Lower  Percentiles  for  b2,p 

Percentiles  Percentiles 


n 

90 

92.5 

95 

97.5 

99 

99.9 

n 

1 

2.5 

5 

10 

90 

95 

97.5 

99 

14 

9.2 

9.7 

10.2 

11.2 

12.7 

14.8 

14 

17.7 

18.0 

18.3 

18.6 

23.0 

24.0 

25.0 

26.1 

16 

8.4 

8.8 

9.4 

10.3 

11.6 

13.6 

16 

18.0 

18.2 

18.6 

18.9 

23.4 

24.4 

25.4 

26.6 

18 

7.7 

8.0 

8.7 

9.5 

10.7 

12.6 

18 

18.2 

18.4 

18.8 

19.2 

23.8 

24.7 

25.8 

26.9 

20 

7.0 

7.4 

8.0 

8.8 

9.9 

11.6 

20 

18.4 

18.6 

19.0 

19.4 

24.0 

25.0 

26.1 

27.1 

25 

5.9 

6.2 

6.6 

7.1 

8.1 

9.7 

25 

18.8 

19.1 

19.5 

19.8 

24.5 

25.4 

26.4 

27.3 

30 

5.0 

5.3 

5.6 

6.0 

6.8 

8.1 

30 

19.1 

19.4 

19.8 

20.2 

24.7 

25.5 

26.6 

27.4 

40 

3.9 

4.1 

4.3 

4.6 

5.2 

6.2 

40 

19.6 

19.9 

20.3 

21.0 

25.0 

25.7 

26.7 

27.4 

50 

3.1 

3.3 

3.5 

3.8 

4.2 

5.0 

50 

20.0 

20.3 

20.6 

21.0 

25.1 

25.7 

26.6 

27.3 

60 

2.7 

2.8 

2.9 

3.2 

3.5 

4.2 

60 

20.2 

20.5 

20.9 

21.3 

25.14 

25.7 

26.6 

27.2 

70 

2.3 

2.4 

2.5 

2.8 

3.0 

3.7 

70 

20.4 

20.7 

21.0 

21.5 

25.15 

25.7 

26.5 

27.0 

80 

2.0 

2.1 

2.2 

2.4 

2.7 

3.2 

80 

20.6 

21.0 

21.2 

21.7 

25.15 

25.6 

26.4 

26.9 

90 

1.81 

1.89 

2.0 

2.2 

2.4 

2.9 

90 

20.8 

21.1 

21.4 

21.8 

25.14 

25.6 

26.3 

26.8 

100 

1.64 

1.71 

1.81 

1.97 

2.2 

2.6 

100 

20.9 

21.2 

21.5 

21.9 

25.12 

25.6 

26.2 

26.7 

150 

1.11 

1.16 

1.22 

1.33 

1.46 

1.76 

150 

21.4 

21.7 

22.0 

22.33 

25.03 

25.42 

25.9 

26.3 

200 

.84 

.87 

.92 

1.00 

1.10 

1.33 

200 

21.7 

22.0 

22.2 

22.57 

24.95 

25.29 

25.6 

26.0 

300 

.56 

.59 

.62 

.67 

.74 

.89 

300 

22.1 

22.33 

22.57 

22.85 

24.83 

25.11 

25.3 

25.7 

400 

.42 

.44 

.47 

.51 

.56 

.67 

400 

22.3 

22.56 

22.77 

23.02 

24.75 

24.99 

25.20 

25.46 

600 

.282 

.295 

.31 

.34 

.37 

.45 

600 

22.63 

22.83 

23.01 

23.21 

24.63 

24.83 

25.01 

25.21 

800 

.212 

.222 

.234 

.255 

.280 

.34 

800 

22.82 

22.99 

23.15 

23.32 

24.56 

24.74 

24.89 

25.06 

1000 

.170 

.177 

.188 

.204 

.224 

.271 

1000 

22.94 

23.10 

23.24 

23.40 

24.51 

24.67 

24.80 

24.96 

1500 

.113 

.118 

.125 

.136 

.150 

.181 

1500 

23.14 

23.27 

23.38 

23.51 

24.42 

24.55 

24.66 

24.79 

2000 

.085 

.089 

.094 

.102 

.112 

.136 

2000 

23.26 

23.37 

23.47 

23.58 

24.37 

24.48 

24.58 

24.69 

3000 

.057 

.059 

.063 

.068 

.075 

.091 

3000 

23.40 

23.49 

23.57 

23.66 

24.31 

24.40 

24.48 

24.57 

4000 

.043 

.045 

.047 

.051 

.056 

.068 

4000 

23.48 

23.56 

23.63 

23.71 

24.27 

24.35 

24.42 

24.50 

5000 

.034 

.039 

.038 

.041 

.045 

.054 

5000 

23.54 

23.61 

23.67 

23.74 

24.24 

24.31 

24.37 

24.45 
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Table  A.6.  Upper  Percentiles  for  Test  of  Single  Multivariate  Normal  Outlier 


Upper  percentage  points  for  the  test  statistic 

D2(n)  =  max (y,  -  y)'S_1(y,-  -  y). 

1  <i  <n 

This  tests  for  a  single  outlier  in  a  sample  of  size  n  from  a  multivariate  normal  distribution. 
Reject  and  conclude  that  the  outlier  is  significant  if  D^n)  exceeds  the  table  value. 

n 

P  = 

=  2 

P 

=  3 

P  = 

=  4 

P 

=  5 

a  =  .05 

a  =  .01 

a  =  .05 

in 

p 

II 

$ 

p 

II 

$ 

a  =  .01 

a  =  .05 

a  =  .01 

5 

3.17 

3.19 

6 

4.00 

4.11 

4.14 

4.16 

7 

4.71 

4.95 

5.01 

5.10 

5.12 

5.14 

8 

5.32 

5.70 

5.77 

5.97 

6.01 

6.09 

6.11 

6.12 

9 

5.85 

6.37 

6.43 

6.76 

6.80 

6.97 

7.01 

7.08 

10 

6.32 

6.97 

7.01 

7.47 

7.50 

7.79 

7.82 

7.98 

12 

7.10 

8.00 

7.99 

8.70 

8.67 

9.20 

9.19 

9.57 

14 

7.74 

8.84 

8.78 

9.71 

9.61 

10.37 

10.29 

10.90 

16 

8.27 

9.54 

9.44 

10.56 

10.39 

11.36 

11.20 

12.02 

18 

8.73 

10.15 

10.00 

11.28 

11.06 

12.20 

11.96 

12.98 

20 

9.13 

10.67 

10.49 

11.91 

11.63 

12.93 

12.62 

13.81 

25 

9.94 

11.73 

11.48 

13.18 

12.78 

14.40 

13.94 

15.47 

30 

10.58 

12.54 

12.24 

14.14 

13.67 

15.51 

14.95 

16.73 

35 

11.10 

13.20 

12.85 

14.92 

14.37 

16.40 

15.75 

17.73 

40 

11.53 

13.74 

13.36 

15.56 

14.96 

17.13 

16.41 

18.55 

45 

11.90 

14.20 

13.80 

16.10 

15.46 

17.74 

16.97 

19.24 

50 

12.23 

14.60 

14.18 

16.56 

15.89 

18.27 

17.45 

19.83 

100 

14.22 

16.95 

16.45 

19.26 

18.43 

21.30 

20.26 

23.17 

200 

15.99 

18.94 

18.42 

21.47 

20.59 

23.72 

22.59 

25.82 

500 

18.12 

21.22 

20.75 

23.95 

23.06 

26.37 

25.21 

28.62 
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Table  A.7.  Upper  Percentage  Points  of  Hotelling’s  T 2  Distribution 


Degrees  of 


Freedom,  v 

P  =  1 

P  =  2 

P  =  3 

P  =  4 

P  =  5 

P  =  6 

P  =  7 

P  =  8 

P  =  9 

p=  10 

a  =  .05 

2 

18.513 

3 

10.128 

57.000 

4 

7.709 

25.472 

114.986 

5 

6.608 

17.361 

46.383 

192.468 

6 

5.987 

13.887 

29.661 

72.937 

289.446 

7 

5.591 

12.001 

22.720 

44.718 

105.157 

405.920 

8 

5.318 

10.828 

19.028 

33.230 

62.561 

143.050 

541.890 

9 

5.117 

10.033 

16.766 

27.202 

45.453 

83.202 

186.622 

697.356 

10 

4.965 

9.459 

15.248 

23.545 

36.561 

59.403 

106.649 

235.873 

872.317 

11 

4.844 

9.026 

14.163 

21.108 

31.205 

47.123 

75.088 

132.903 

290.806 

1066.774 

12 

4.747 

8.689 

13.350 

19.376 

27.656 

39.764 

58.893 

92.512 

161.967 

351.421 

13 

4.667 

8.418 

12.719 

18.086 

25.145 

34.911 

49.232 

71.878 

111.676 

193.842 

14 

4.600 

8.197 

12.216 

17.089 

23.281 

31.488 

42.881 

59.612 

86.079 

132.582 

15 

4.543 

8.012 

11.806 

16.296 

21.845 

28.955 

38.415 

51.572 

70.907 

101.499 

16 

4.494 

7.856 

11.465 

15.651 

20.706 

27.008 

35.117 

45.932 

60.986 

83.121 

17 

4.451 

7.722 

11.177 

15.117 

19.782 

25.467 

32.588 

41.775 

54.041 

71.127 

18 

4.414 

7.606 

10.931 

14.667 

19.017 

24.219 

30.590 

38.592 

48.930 

62.746 

19 

4.381 

7.504 

10.719 

14.283 

18.375 

23.189 

28.975 

36.082 

45.023 

56.587 

20 

4.351 

7.415 

10.533 

13.952 

17.828 

22.324 

27.642 

34.054 

41.946 

51.884 

21 

4.325 

7.335 

10.370 

13.663 

17.356 

21.588 

26.525 

32.384 

39.463 

48.184 

22 

4.301 

7.264 

10.225 

13.409 

16.945 

20.954 

25.576 

30.985 

37.419 

45.202 

23 

4.279 

7.200 

10.095 

13.184 

16.585 

20.403 

24.759 

29.798 

35.709 

42.750 

24 

4.260 

7.142 

9.979 

12.983 

16.265 

19.920 

24.049 

28.777 

34.258 

40.699 

25 

4.242 

7.089 

9.874 

12.803 

15.981 

19.492 

23.427 

27.891 

33.013 

38.961 

26 

4.225 

7.041 

9.779 

12.641 

15.726 

19.112 

22.878 

27.114 

31.932 

37.469 
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a  =  .05 


27 

4.210 

6.997 

9.692 

12.493 

15.496 

18.770 

22.388 

26.428 

30.985 

36.176 

28 

4.196 

6.957 

9.612 

12.359 

15.287 

18.463 

21.950 

25.818 

30.149 

35.043 

29 

4.183 

6.919 

9.539 

12.236 

15.097 

18.184 

21.555 

25.272 

29.407 

34.044 

30 

4.171 

6.885 

9.471 

12.123 

14.924 

17.931 

21.198 

24.781 

28.742 

33.156 

35 

4.121 

6.744 

9.200 

11.674 

14.240 

16.944 

19.823 

22.913 

26.252 

29.881 

40 

4.085 

6.642 

9.005 

11.356 

13.762 

16.264 

18.890 

21.668 

24.624 

27.783 

45 

4.057 

6.564 

8.859 

11.118 

13.409 

15.767 

18.217 

20.781 

23.477 

26.326 

50 

4.034 

6.503 

8.744 

10.934 

13.138 

15.388 

17.709 

20.117 

22.627 

25.256 

55 

4.016 

6.454 

8.652 

10.787 

12.923 

15.090 

17.311 

19.600 

21.972 

24.437 

60 

4.001 

6.413 

8.577 

10.668 

12.748 

14.850 

16.992 

19.188 

21.451 

23.790 

70 

3.978 

6.350 

8.460 

10.484 

12.482 

14.485 

16.510 

18.571 

20.676 

22.834 

80 

3.960 

6.303 

8.375 

10.350 

12.289 

14.222 

16.165 

18.130 

20.127 

22.162 

90 

3.947 

6.267 

8.309 

10.248 

12.142 

14.022 

15.905 

17.801 

19.718 

21.663 

100 

3.936 

6.239 

8.257 

10.167 

12.027 

13.867 

15.702 

17.544 

19.401 

21.279 

110 

3.927 

6.216 

8.215 

10.102 

11.934 

13.741 

15.540 

17.340 

19.149 

20.973 

120 

3.920 

6.196 

8.181 

10.048 

11.858 

13.639 

15.407 

17.172 

18.943 

20.725 

150 

3.904 

6.155 

8.105 

9.931 

11.693 

13.417 

15.121 

16.814 

18.504 

20.196 

200 

3.888 

6.113 

8.031 

9.817 

11.531 

13.202 

14.845 

16.469 

18.083 

19.692 

400 

3.865 

6.052 

7.922 

9.650 

11.297 

12.890 

14.447 

15.975 

17.484 

18.976 

1000 

3.851 

6.015 

7.857 

9.552 

11.160 

12.710 

14.217 

15.692 

17.141 

18.570 

00 

3.841 

5.991 

7.815 

9.488 

11.070 

12.592 

14.067 

15.507 

16.919 

18.307 

( continued ) 
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Table  A.7.  ( Continued ) 


Degrees  of 


Freedom,  v 

P  =  1 

P  =  2 

P  =  3 

P  =  4 

p  =  5 

P  =  6 

p  =  l 

P  =  8 

P  =  9 

p  =  10 

a  =  .01 

2 

98.503 

3 

34.116 

297.000 

4 

21.198 

82.177 

594.997 

5 

16.258 

45.000 

147.283 

992.494 

6 

13.745 

31.857 

75.125 

229.679 

1489.489 

7 

12.246 

25.491 

50.652 

111.839 

329.433 

2085.984 

8 

11.259 

21.821 

39.118 

72.908 

155.219 

446.571 

2781.978 

9 

10.561 

19.460 

32.598 

54.890 

98.703 

205.293 

581.106 

3577.472 

10 

10.044 

17.826 

28.466 

44.838 

72.882 

128.067 

262.076 

733.045 

4472.464 

11 

9.646 

16.631 

25.637 

38.533 

58.618 

93.127 

161.015 

325.576 

902.392 

5466.956 

12 

9.330 

15.722 

23.588 

34.251 

49.739 

73.969 

115.640 

197.555 

395.797 

1089.149 

13 

9.074 

15.008 

22.041 

31.171 

43.745 

62.114 

90.907 

140.429 

237.692 

472.742 

14 

8.862 

14.433 

20.834 

28.857 

39.454 

54.150 

75.676 

109.441 

167.499 

281.428 

15 

8.683 

13.960 

19.867 

27.060 

36.246 

48.472 

65.483 

90.433 

129.576 

196.853 

16 

8.531 

13.566 

19.076 

25.626 

33.672 

44.240 

58.241 

77.755 

106.391 

151.316 

17 

8.400 

13.231 

18.418 

24.458 

31.788 

40.975 

52.858 

68.771 

90.969 

123.554 

18 

8.285 

12.943 

17.861 

23.487 

30.182 

38.385 

48.715 

62.109 

80.067 

105.131 

19 

8.185 

12.694 

17.385 

22.670 

28.852 

36.283 

45.435 

56.992 

71.999 

92.134 

20 

8.096 

12.476 

16.973 

21.972 

27.734 

34.546 

42.779 

52.948 

65.813 

82.532 

21 

8.017 

12.283 

16.613 

21.369 

26.781 

33.088 

40.587 

49.679 

60.932 

75.181 

22 

7.945 

12.111 

16.296 

20.843 

25.959 

31.847 

38.750 

46.986 

56.991 

69.389 

23 

7.881 

11.958 

16.015 

20.381 

25.244 

30.779 

37.188 

44.730 

53.748 

64.719 

24 

7.823 

11.820 

15.763 

19.972 

24.616 

29.850 

35.846 

42.816 

51.036 

60.879 

25 

7.770 

11.695 

15.538 

19.606 

24.060 

29.036 

34.680 

41.171 

48.736 

57.671 

26 

7.721 

11.581 

15.334 

19.279 

23.565 

28.316 

33.659 

39.745 

46.762 

54.953 

27 

7.677 

11.478 

15.149 

18.983 

23.121 

27.675 

32.756 

38.496 

45.051 

52.622 

a  =  .01 


28 

7.636 

11.383 

14.980 

18.715 

22.721 

27.101 

31.954 

37.393 

43.554 

50.604 

29 

7.598 

11.295 

14.825 

18.471 

22.359 

26.584 

31.236 

36.414 

42.234 

48.839 

30 

7.562 

11.215 

14.683 

18.247 

22.029 

26.116 

30.589 

35.538 

41.062 

47.283 

35 

7.419 

10.890 

14.117 

17.366 

20.743 

24.314 

28.135 

32.259 

36.743 

41.651 

40 

7.314 

10.655 

13.715 

16.750 

19.858 

23.094 

26.502 

30.120 

33.984 

38.135 

45 

7.234 

10.478 

13.414 

16.295 

19.211 

22.214 

25.340 

28.617 

32.073 

35.737 

50 

7.171 

10.340 

13.181 

15.945 

18.718 

21.550 

24.470 

27.504 

30.673 

33.998 

55 

7.119 

10.228 

12.995 

15.667 

18.331 

21.030 

23.795 

26.647 

29.603 

32.682 

60 

7.077 

10.137 

12.843 

15.442 

18.018 

20.613 

23.257 

25.967 

28.760 

31.650 

70 

7.011 

9.996 

12.611 

15.098 

17.543 

19.986 

22.451 

24.957 

27.515 

30.139 

80 

6.963 

9.892 

12.440 

14.849 

17.201 

19.536 

21.877 

24.242 

26.642 

29.085 

90 

6.925 

9.813 

12.310 

14.660 

16.942 

19.197 

21.448 

23.710 

25.995 

28.310 

100 

6.895 

9.750 

12.208 

14.511 

16.740 

18.934 

21.115 

23.299 

25.496 

27.714 

110 

6.871 

9.699 

12.125 

14.391 

16.577 

18.722 

20.849 

22.972 

25.101 

27.243 

120 

6.851 

9.657 

12.057 

14.292 

16.444 

18.549 

20.632 

22.705 

24.779 

26.862 

150 

6.807 

9.565 

11.909 

14.079 

16.156 

18.178 

20.167 

22.137 

24.096 

26.054 

200 

6.763 

9.474 

11.764 

13.871 

15.877 

17.819 

19.720 

21.592 

23.446 

25.287 

400 

6.699 

9.341 

11.551 

13.569 

15.473 

17.303 

19.080 

20.818 

22.525 

24.209 

1000 

6.660 

9.262 

11.426 

13.392 

15.239 

17.006 

18.743 

20.376 

22.003 

23.600 

00 

6.635 

9.210 

11.345 

13.277 

15.086 

16.812 

18.475 

20.090 

21.666 

23.209 

Note:  p  =  number  of  variables. 
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Table  A.8.  Bonferonni  t- Values,  ta/2t,v,  a  =  .05 

k 

123456  7  8  9  10 

100a /k 

v  5.0000  2.5000  1.6667  1.2500  1.0000  .8333  .7143  .6250  .5556  .5000 


2 

4.3027 

6.2053 

3 

3.1824 

4.1765 

4 

2.7764 

3.4954 

5 

2.5706 

3.1634 

6 

2.4469 

2.9687 

7 

2.3646 

2.8412 

8 

2.3060 

2.7515 

9 

2.2622 

2.6850 

10 

2.2281 

2.6338 

11 

2.2010 

2.5931 

12 

2.1788 

2.5600 

13 

2.1604 

2.5326 

14 

2.1448 

2.5096 

15 

2.1314 

2.4899 

16 

2.1199 

2.4729 

17 

2.1098 

2.4581 

18 

2.1009 

2.4450 

19 

2.0930 

2.4334 

20 

2.0860 

2.4231 

21 

2.0796 

2.4138 

22 

2.0739 

2.4055 

23 

2.0687 

2.3979 

24 

2.0639 

2.3909 

25 

2.0595 

2.3846 

26 

2.0555 

2.3788 

7.6488  8.8602 
4.8567  5.3919 
3.9608  4.3147 
3.5341  3.8100 
3.2875  3.5212 
3.1276  3.3353 
3.0158  3.2060 
2.9333  3.1109 
2.8701  3.0382 
2.8200  2.9809 
2.7795  2.9345 
2.7459  2.8961 
2.7178  2.8640 
2.6937  2.8366 
2.6730  2.8131 
2.6550  2.7925 
2.6391  2.7745 
2.6251  2.7586 
2.6126  2.7444 
2.6013  2.7316 
2.5912  2.7201 
2.5820  2.7097 
2.5736  2.7002 
2.5660  2.6916 
2.5589  2.6836 


9.9248 

10.8859 

5.8409 

6.2315 

4.6041 

4.8510 

4.0321 

4.2193 

3.7074 

2.8630 

3.4995 

3.6358 

3.3554 

3.4789 

3.2498 

3.3642 

3.1693 

3.2768 

3.1058 

3.2081 

3.0545 

3.1527 

3.0123 

3.1070 

2.9768 

3.0688 

2.9467 

3.0363 

2.9208 

3.0083 

2.8982 

2.9840 

2.8784 

2.9627 

2.8609 

2.9439 

2.8453 

2.9271 

2.8314 

2.9121 

2.8188 

2.8985 

2.8073 

2.8863 

2.7969 

2.8751 

2.7874 

2.8649 

2.7787 

2.8555 

11.7687 

12.5897 

6.5797 

6.8952 

5.0675 

5.2611 

4.3818 

4.5257 

3.9971 

4.1152 

3.7527 

3.8552 

3.5844 

3.6766 

3.4616 

3.5465 

3.3682 

3.4477 

3.2949 

3.3702 

3.2357 

3.3078 

3.1871 

3.2565 

3.1464 

3.2135 

3.1118 

3.1771 

3.0821 

3.1458 

3.0563 

3.1186 

3.0336 

3.0948 

3.0136 

3.0738 

2.9958 

3.0550 

2.9799 

3.0382 

2.9655 

3.0231 

2.9525 

3.0095 

2.9406 

2.9970 

2.9298 

2.9856 

2.9199 

2.9752 

13.3604 

14.0890 

7.1849 

7.4533 

5.4366 

5.5976 

4.6553 

4.7733 

4.2209 

4.3168 

3.9467 

4.0293 

3.7586 

3.8325 

3.6219 

3.6897 

3.5182 

3.5814 

3.4368 

3.4966 

3.3714 

3.4284 

3.3177 

3.3725 

3.2727 

3.3257 

3.2346 

3.2860 

3.2019 

3.2520 

3.1735 

3.2224 

3.1486 

3.1966 

3.1266 

3.1737 

3.1070 

3.1534 

3.0895 

3.1352 

3.0737 

3.1188 

3.0595 

3.1040 

3.0465 

3.0905 

3.0346 

3.0782 

3.0237 

3.0669 
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Table  A.8.  ( Continued ) 

k 

123456789  10 

100 a/k 

v  5.0000  2.5000  1.6667  1.2500  1.0000  .8333  .7143  .6250  .5556  .5000 

27  2.0518  2.3734  2.5525  2.6763  2.7707  2.8469  2.9107  2.9656  3.0137  3.0565 

28  2.0484  2.3685  2.5465  2.6695  2.7633  2.8389  2.9023  2.9567  3.0045  3.0469 

29  2.0452  2.3638  2.5409  2.6632  2.7564  2.8316  2.8945  2.9485  2.9959  3.0380 

30  2.0423  2.3596  2.5357  2.6574  2.7500  2.8247  2.8872  2.9409  2.9880  3.0298 

35  2.0301  2.3420  2.5145  2.6334  2.7238  2.7966  2.8575  2.9097  2.9554  2.9960 

40  2.0211  2.3289  2.4989  2.6157  2.7045  2.7759  2.8355  2.8867  2.9314  2.9712 

45  2.0141  2.3189  2.4868  2.6021  2.6896  2.7599  2.8187  2.8690  2.9130  2.9521 

50  2.0086  2.3109  2.4772  2.5913  2.6778  2.7473  2.8053  2.8550  2.8984  2.9370 

55  2.0040  2.3044  2.4694  2.5825  2.6682  2.7370  2.7944  2.8436  2.8866  2.9247 

60  2.0003  2.2990  2.4630  2.5752  2.6603  2.7286  2.7855  2.8342  2.8768  2.9146 

70  1.9944  2.2906  2.4529  2.5639  2.6479  2.7153  2.7715  2.8195  2.8615  2.8987 

80  1.9901  2.2844  2.4454  2.5554  2.6387  2.7054  2.7610  2.8086  2.8502  2.8870 

90  1.9867  2.2795  2.4395  2.5489  2.6316  2.6978  2.7530  2.8002  2.8414  2.8779 

100  1.9840  2.2757  2.4349  2.5437  2.6259  2.6918  2.7466  2.7935  2.8344  2.8707 

110  1.9818  2.2725  2.4311  2.5394  2.6213  2.6868  2.7414  2.7880  2.8287  2.8648 

120  1.9799  2.2699  2.4280  2.5359  2.6174  2.6827  2.7370  2.7835  2.8240  2.8599 

250  1.9695  2.2550  2.4102  2.5159  2.5956  2.6594  2.7124  2.7577  2.7972  2.8322 

500  1.9647  2.2482  2.4021  2.5068  2.5857  2.6488  2.7012  2.7460  2.7850  2.8195 

1000  1.9623  2.2448  2.3980  2.5022  2.5808  2.6435  2.6957  2.7402  2.7790  2.8133 

oo  1.9600  2.2414  2.3940  2.4977  2.5758  2.6383  2.6901  2.7344  2.7729  2.8070 

(continued) 
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k 


11 

12 

13 

14  15 

100a /k 

16 

17 

18 

19 

V 

.4545 

.4167 

.3846 

.3571 

.3333 

.3125 

.2941 

.2778 

.2632 

2 

14.7818 

15.4435 

16.0780 

16.6883 

17.2772 

17.8466 

18.3984 

18.9341 

19.4551 

3 

7.7041 

7.9398 

8.1625 

8.3738 

8.5752 

8.7676 

8.9521 

9.1294 

9.3001 

4 

5.7465 

5.8853 

6.0154 

6.1380 

6.2541 

6.3643 

6.4693 

6.5697 

6.6659 

5 

4.8819 

4.9825 

5.0764 

5.1644 

5.2474 

5.3259 

5.4005 

5.4715 

5.5393 

6 

4.4047 

4.4858 

4.5612 

4.6317 

4.6979 

4.7604 

4.8196 

4.8759 

4.9295 

7 

4.1048 

4.1743 

4.2388 

4.2989 

4.3553 

4.4084 

4.4586 

4.5062 

4.5514 

8 

3.8999 

3.9618 

4.0191 

4.0724 

4.1224 

4.1693 

4.2137 

4.2556 

4.2955 

9 

3.7513 

3.8079 

3.8602 

3.9088 

3.9542 

3.9969 

4.0371 

4.0752 

4.1114 

10 

3.6388 

3.6915 

3.7401 

3.7852 

3.8273 

3.8669 

3.9041 

3.9394 

3.9728 

11 

3.5508 

3.6004 

3.6462 

3.6887 

3.7283 

3.7654 

3.8004 

3.8335 

3.8648 

12 

3.4801 

3.5274 

3.5709 

3.6112 

3.6489 

3.6842 

3.7173 

3.7487 

3.7783 

13 

3.4221 

3.4674 

3.5091 

3.5478 

3.5838 

3.6176 

3.6493 

3.6793 

3.7076 

14 

3.3736 

3.4173 

3.4576 

3.4949 

3.5296 

3.5621 

3.5926 

3.6214 

3.6487 

15 

3.3325 

3.3749 

3.4139 

3.4501 

3.4837 

3.5151 

3.5447 

3.5725 

3.5989 

16 

3.2973 

3.3386 

3.3765 

3.4116 

3.4443 

3.4749 

3.5036 

3.5306 

3.5562 

17 

3.2667 

3.3070 

3.3440 

3.3783 

3.4102 

3.4400 

3.4680 

3.4944 

3.5193 

18 

3.2399 

3.2794 

3.3156 

3.3492 

3.3804 

3.4095 

3.4369 

3.4626 

3.4870 

19 

3.2163 

3.2550 

3.2906 

3.3235 

3.3540 

3.3826 

3.4094 

3.4347 

3.4585 

20 

3.1952 

3.2333 

3.2683 

3.3006 

3.3306 

3.3587 

3.3850 

3.4098 

3.4332 

21 

3.1764 

3.2139 

3.2483 

3.2802 

3.3097 

3.3373 

3.3632 

3.3876 

3.4106 

22 

3.1595 

3.1965 

3.2304 

3.2618 

3.2909 

3.3181 

3.3436 

3.3676 

3.3903 

23 

3.1441 

3.1807 

3.2142 

3.2451 

3.2739 

3.3007 

3.3259 

3.3495 

3.3719 

24 

3.1302 

3.1663 

3.1994 

3.2300 

3.2584 

3.2849 

3.3097 

3.3331 

3.3552 

25 

3.1175 

3.1532 

3.1859 

3.2162 

3.2443 

3.2705 

3.2950 

3.3181 

3.3400 

26 

3.1058 

3.1412 

3.1736 

3.2035 

3.2313 

3.2572 

3.2815 

3.3044 

3.3260 
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Table  A.8.  ( Continued ) 


k 


11 

12 

13 

14  15 

100 a/k 

16 

17 

18 

19 

V 

.4545 

.4167 

.3846 

.3571 

.3333 

.3125 

.2941 

.2778 

.2632 

27 

3.0951 

3.1301 

3.1622 

3.1919 

3.2194 

3.2451 

3.2691 

3.2918 

3.3132 

28 

3.0852 

3.1199 

3.1517 

3.1811 

3.2084 

3.2339 

3.2577 

3.2801 

3.3013 

29 

3.0760 

3.1105 

3.1420 

3.1712 

3.1982 

3.2235 

3.2471 

3.2694 

3.2904 

30 

3.0675 

3.1017 

3.1330 

3.1620 

3.1888 

3.2138 

3.2373 

3.2594 

3.2802 

35 

3.0326 

3.0658 

3.0962 

3.1242 

3.1502 

3.1744 

3.1971 

3.2185 

3.2386 

40 

3.0069 

3.0393 

3.0690 

3.0964 

3.1218 

3.1455 

3.1676 

3.1884 

3.2081 

45 

2.9872 

3.0191 

3.0482 

3.0751 

3.1000 

3.1232 

3.1450 

3.1654 

3.1846 

50 

2.9716 

3.0030 

3.0318 

3.0582 

3.0828 

3.1057 

3.1271 

3.1472 

3.1661 

55 

2.9589 

2.9900 

3.0184 

3.0446 

3.0688 

3.0914 

3.1125 

3.1324 

3.1511 

60 

2.9485 

2.9792 

3.0074 

3.0333 

3.0573 

3.0796 

3.1005 

3.1202 

3.1387 

70 

2.9321 

2.9624 

2.9901 

3.0156 

3.0393 

3.0613 

3.0818 

3.1012 

3.1194 

80 

2.9200 

2.9500 

2.9773 

3.0026 

3.0259 

3.0476 

3.0679 

3.0870 

3.1050 

90 

2.9106 

2.9403 

2.9675 

2.9924 

3.0156 

3.0371 

3.0572 

3.0761 

3.0939 

100 

2.9032 

2.9327 

2.9596 

2.9844 

3.0073 

3.0287 

3.0487 

3.0674 

3.0851 

110 

2.8971 

2.9264 

2.9532 

2.9778 

3.0007 

3.0219 

3.0417 

3.0604 

3.0779 

120 

2.8921 

2.9212 

2.9479 

2.9724 

2.9951 

3.0162 

3.0360 

3.0545 

3.0720 

250 

2.8635 

2.8919 

2.9178 

2.9416 

2.9637 

2.9842 

3.0034 

3.0213 

3.0383 

500 

2.8505 

2.8785 

2.9041 

2.9276 

2.9494 

2.9696 

2.9885 

3.0063 

3.0230 

1000 

2.8440 

2.8719 

2.8973 

2.9207 

2.9423 

2.9624 

2.9812 

2.9988 

3.0154 

00 

2.8376 

2.8653 

2.8905 

2.9137 

2.9352 

2.9552 

2.9738 

2.9913 

3.0078 

Table  A.9.  Lower  Critical  Values  of  Wilks  A,  a  =  .05 


A  =  — — —  =  f|  — - — , 

|E  +  HI  y  1  +  A, 

where  Ai,  A2,  . . .  ,  As  are  eigenvalues  of  E~*H.  Reject  H0  if  A  <  table  value.  “  Multiply  entry 
by  1(T3. 


vh 


VE 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1 

6.16° 

2.50“ 

1.54" 

1.11“ 

.868" 

P=  1 
.712" 

.603" 

.523" 

.462" 

.413" 

.374" 

.341' 

2 

.098 

.050 

.034 

.025 

.020 

.017 

.015 

.013 

.011 

.010 

9.28" 

8.51" 

3 

.229 

.136 

.097 

.076 

.062 

.053 

.046 

.041 

.036 

.033 

.030 

.028 

4 

.342 

.224 

.168 

.135 

.113 

.098 

.086 

.076 

.069 

.063 

.058 

.053 

5 

.431 

.302 

.236 

.194 

.165 

.144 

.128 

.115 

.104 

.096 

.088 

.082 

6 

.501 

.368 

.296 

.249 

.215 

.189 

.169 

.153 

.140 

.129 

.119 

.111 

7 

.556 

.425 

.349 

.298 

.261 

.232 

.209 

.190 

.175 

.161 

.150 

.140 

8 

.601 

.473 

.396 

.343 

.303 

.271 

.246 

.225 

.208 

.193 

.180 

.169 

9 

.638 

.514 

.437 

.382 

.341 

.308 

.281 

.258 

.239 

.223 

.209 

.196 

10 

.668 

.549 

.473 

.418 

.376 

.341 

.313 

.289 

.269 

.251 

.236 

.222 

11 

.694 

.580 

.505 

.450 

.407 

.372 

.343 

.318 

.297 

.278 

.262 

.247 

12 

.717 

.607 

.534 

.479 

.436 

.400 

.370 

.345 

.323 

.304 

.286 

.271 

13 

.736 

.631 

.560 

.506 

.462 

.426 

.396 

.370 

.347 

.327 

.310 

.294 

14 

.753 

.652 

.583 

.529 

.486 

.450 

.420 

.393 

.370 

.350 

.332 

.315 

15 

.768 

.671 

.603 

.551 

.508 

.473 

.442 

.415 

.392 

.371 

.352 

.336 

16 

.781 

.688 

.622 

.571 

.529 

.493 

.462 

.436 

.412 

.391 

.372 

.355 

17 

.792 

.703 

.639 

.589 

.548 

.512 

.482 

.455 

.431 

.410 

.390 

.373 

18 

.803 

.717 

.655 

.606 

.565 

.530 

.499 

.473 

.449 

.427 

.408 

.390 

19 

.813 

.730 

.669 

.621 

.581 

.546 

.516 

.490 

.466 

.444 

.425 

.407 

20 

.821 

.741 

.683 

.636 

.596 

.562 

.532 

.505 

.482 

.460 

.440 

.423 

21 

.829 

.752 

.695 

.649 

.610 

.576 

.547 

.520 

.497 

.475 

.455 

.437 

22 

.836 

.762 

.706 

.661 

.623 

.590 

.561 

.534 

.511 

.489 

.470 

.452 

23 

.843 

.771 

.717 

.673 

.635 

.603 

.574 

.548 

.524 

.503 

.483 

.465 

24 

.849 

.779 

.727 

.684 

.647 

.615 

.586 

.560 

.537 

.516 

.496 

.478 

25 

.855 

.787 

.736 

.694 

.658 

.626 

.598 

.572 

.549 

.528 

.508 

.490 

26 

.860 

.794 

.744 

.703 

.668 

.637 

.609 

.583 

.560 

.539 

.520 

.502 

27 

.865 

.801 

.752 

.712 

.677 

.647 

.619 

.594 

.571 

.551 

.531 

.513 

28 

.870 

.807 

.760 

.721 

.686 

.656 

.629 

.604 

.582 

.561 

.542 

.524 

29 

.874 

.813 

.767 

.729 

.695 

.665 

.638 

.614 

.592 

.571 

.552 

.535 

30 

.878 

.819 

.774 

.736 

.703 

.674 

.647 

.623 

.601 

.581 

.562 

.544 

40 

.907 

.861 

.824 

.793 

.766 

.741 

.718 

.696 

.677 

.658 

.641 

.625 

60 

.938 

.905 

.879 

.856 

.835 

.816 

.798 

.781 

.766 

.751 

.736 

.723 

80 

.953 

.928 

.907 

.889 

.873 

.858 

.843 

.829 

.816 

.804 

.792 

.780 

100 

.962 

.942 

.925 

.910 

.897 

.884 

.872 

.860 

.849 

.838 

.828 

.818 

120 

.968 

.951 

.937 

.925 

.913 

.902 

.891 

.882 

.872 

.863 

.854 

.845 

140 

.973 

.958 

.946 

.935 

.925 

.915 

.906 

.897 

.889 

.881 

.873 

.865 

170 

.978 

.965 

.955 

.946 

.937 

.929 

.922 

.914 

.907 

.900 

.893 

.887 

200 

.981 

.970 

.962 

.954 

.947 

.940 

.933 

.926 

.920 

.914 

.908 

.902 

240 

.984 

.975 

.968 

.961 

.955 

.949 

.944 

.938 

.933 

.928 

.923 

.918 

320 

.988 

.981 

.976 

.971 

.966 

.962 

.957 

.953 

.949 

.945 

.941 

.937 

440 

.991 

.986 

.982 

.979 

.975 

.972 

.969 

.966 

.963 

.960 

.957 

.954 

600 

.994 

.990 

.987 

.984 

.982 

.979 

.977 

.975 

.972 

.970 

.968 

.966 

800 

.995 

.993 

.990 

.988 

.986 

.984 

.983 

.981 

.979 

.977 

.976 

.974 

1000 

.996 

.994 

.992 

.991 

.989 

.988 

.986 

.985 

.983 

.982 

.981 

.979 

( continued ) 
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Table  A.9.  ( Continued ) 


VH 


VE 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1 

.000 

.000 

.000 

.000 

.000 

P  = 
.000 

2 

.000 

.000 

.000 

.000 

.000 

.000 

2 

2.50" 

.641“ 

.287“ 

.162“ 

.104“ 

.072" 

.053" 

.041" 

.032" 

.026" 

.022" 

.018' 

3 

.050 

.018 

9.53“ 

5.84" 

3.95" 

2.85" 

2.15" 

1.68" 

1.35" 

1.11" 

.928" 

.787' 

4 

.136 

.062 

.036 

.023 

.017 

.012 

9.56" 

7.62" 

6.21" 

5.17" 

4.36" 

3.73" 

5 

.224 

.117 

.074 

.051 

.037 

.028 

.023 

.018 

.015 

.013 

.011 

.009 

6 

.302 

.175 

.116 

.084 

.063 

.049 

.040 

.033 

.027 

.023 

.020 

.017 

7 

.368 

.230 

.160 

.119 

.092 

.074 

.060 

.050 

.042 

.036 

.032 

.028 

8 

.4256 

.280 

.203 

.155 

.122 

.099 

.082 

.069 

.059 

.051 

.045 

.040 

9 

.473 

.326 

.243 

.190 

.153 

.126 

.106 

.090 

.078 

.068 

.060 

.053 

10 

.514 

.367 

.281 

.223 

.183 

.152 

.129 

.111 

.097 

.085 

.075 

.067 

11 

.549 

.404 

.316 

.255 

.212 

.179 

.153 

.133 

.116 

.102 

.091 

.082 

12 

.580 

.437 

.348 

.286 

.240 

.204 

.176 

.154 

.136 

.120 

.108 

.097 

13 

.607 

.467 

.378 

.314 

.266 

.229 

.199 

.175 

.155 

.138 

.124 

.112 

14 

.631 

.495 

.405 

.340 

.291 

.252 

.221 

.195 

.174 

.156 

.141 

.128 

15 

.652 

.519 

.431 

.365 

.315 

.275 

.242 

.215 

.193 

.174 

.157 

.143 

16 

.671 

.542 

.454 

.389 

.337 

.296 

.263 

.235 

.211 

.191 

.174 

.159 

17 

.688 

.562 

.476 

.410 

.359 

.317 

.282 

.254 

.229 

.208 

.190 

.174 

18 

.703 

.581 

.496 

.431 

.379 

.337 

.301 

.272 

.246 

.225 

.206 

.189 

19 

.717 

.598 

.515 

.450 

.398 

.355 

.320 

.289 

.263 

.241 

.221 

.204 

20 

.730 

.614 

.532 

.468 

.416 

.373 

.337 

.306 

.279 

.256 

.236 

.218 

21 

.741 

.629 

.548 

.485 

.433 

.390 

.354 

.322 

.295 

.271 

.251 

.232 

22 

.752 

.643 

.564 

.501 

.449 

.406 

.370 

.338 

.310 

.286 

.265 

.246 

23 

.762 

.656 

.578 

.516 

.465 

.422 

.385 

.353 

.325 

.300 

.279 

.259 

24 

.771 

.668 

.591 

.530 

.479 

.436 

.399 

.367 

.339 

.314 

.292 

.272 

25 

.779 

.679 

.604 

.544 

.493 

.450 

.413 

.381 

.353 

.328 

.305 

.285 

26 

.787 

.689 

.616 

.556 

.506 

.464 

.427 

.395 

.366 

.341 

.318 

.297 

27 

.794 

.699 

.627 

.568 

.519 

.477 

.440 

.407 

.379 

.353 

.330 

.309 

28 

.801 

.708 

.638 

.580 

.531 

.489 

.452 

.420 

.391 

.365 

.342 

.321 

29 

.807 

.717 

.648 

.591 

.542 

.501 

.464 

.432 

.403 

.377 

.354 

.332 

30 

.813 

.725 

.657 

.601 

.553 

.512 

.475 

.443 

.414 

.388 

.365 

.344 

40 

.858 

.786 

.730 

.682 

.640 

.602 

.568 

.537 

.509 

.484 

.460 

.439 

60 

.903 

.853 

.811 

.774 

.741 

.710 

.682 

.656 

.632 

.609 

.588 

.568 

80 

.927 

.888 

.854 

.825 

.798 

.772 

.749 

.727 

.706 

.686 

.667 

.649 

100 

.941 

.909 

.882 

.857 

.834 

.813 

.793 

.774 

.755 

.738 

.721 

.705 

120 

.951 

.924 

.900 

.879 

.860 

.841 

.823 

.807 

.791 

.775 

.760 

.746 

140 

.958 

.934 

.914 

.895 

.878 

.862 

.846 

.831 

.817 

.803 

.790 

.777 

170 

.965 

.946 

.929 

.913 

.898 

.885 

.871 

.859 

.846 

.834 

.823 

.812 

200 

.970 

.954 

.939 

.926 

.913 

.901 

.889 

.878 

.867 

.857 

.847 

.837 

240 

.975 

.961 

.949 

.938 

.927 

.917 

.907 

.897 

.888 

.879 

.870 

.862 

320 

.981 

.971 

.962 

.953 

.945 

.937 

.929 

.922 

.914 

.907 

.901 

.894 

440 

.986 

.979 

.972 

.965 

.959 

.953 

.948 

.942 

.937 

.932 

.926 

.921 

600 

.990 

.984 

.979 

.975 

.970 

.966 

.961 

.957 

.953 

.949 

.945 

.942 

800 

.993 

.988 

.984 

.981 

.977 

.974 

.971 

.968 

.965 

.962 

.959 

.956 

1000 

.994 

.991 

.987 

.985 

.982 

.979 

.977 

.974 

.972 

.969 

.967 

.964 

"  Multiply  entry  by  10~3. 
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Table  A.9.  ( Continued ) 


VH 


VE 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1 

.000  .000 

.000 

.000 

.000 

P  = 
.000 

:  3 

.000 

.000 

.000 

.000 

.000 

.000 

2 

.000  .000 

.000 

.000 

.000 

.001" 

.002" 

.004" 

.005" 

.008" 

.010" 

.013' 

3 

1.70" 

.354" 

.179" 

.127" 

.105" 

.095" 

.091" 

.090" 

.091" 

.092" 

.095" 

.098' 

4 

.034  .010 

.004 

.002 

.001 

.001 

.809" 

.659" 

.562" 

.496" 

.449" 

.416' 

5 

.097  .036 

.018 

.010 

6.36" 

4.37" 

3.20" 

2.46" 

1.97" 

1.64" 

1.40" 

1.22" 

6 

.168  .074 

.040 

.024 

.016 

.011 

.008 

.006 

.004 

3.94" 

3.28" 

2.79" 

7 

.236 

.116 

.068 

.043 

.029 

.021 

.016 

.012 

9.49" 

7.67" 

6.35" 

5.35" 

8 

.296 

.160 

.099 

.066 

.046 

.034 

.026 

.020 

.016 

.013 

.011 

9.00" 

9 

.349  .203 

.131 

.091 

.066 

.049 

.038 

.030 

.024 

.020 

.016 

.014 

10 

.396  .243 

.164 

.117 

.086 

.066 

.052 

.041 

.034 

.028 

.023 

.020 

11 

.437  .281 

.196 

.143 

.108 

.084 

.067 

.054 

.044 

.037 

.031 

.026 

12 

.473  .316 

.226 

.169 

.130 

.103 

.083 

.067 

.056 

.047 

.040 

.034 

13 

.505  .348 

.255 

.194 

.152 

.122 

.099 

.082 

.068 

.058 

.049 

.042 

14 

.534  .378 

.283 

.219 

.174 

.141 

.116 

.096 

.081 

.069 

.059 

.051 

15 

.560  .405 

.309 

.243 

.195 

.160 

.133 

.111 

.095 

.081 

.070 

.061 

16 

.583  .431 

.334 

.266 

.216 

.179 

.149 

.127 

.108 

.093 

.081 

.071 

17 

.603  .454 

.357 

.288 

.236 

.197 

.166 

.142 

.122 

.106 

.092 

.081 

18 

.622  .476 

.379 

.309 

.256 

.215 

.183 

.157 

.136 

.118 

.104 

.092 

19 

.639  .496 

.399 

.329 

.275 

.233 

.199 

.172 

.149 

.131 

.115 

.102 

20 

.655  .515 

.419 

.348 

.293 

.250 

.215 

.187 

.163 

.144 

.127 

.113 

21 

.669  .532 

.437 

.366 

.310 

.266 

.230 

.201 

.177 

.156 

.139 

.124 

22 

.683  .548 

.454 

.383 

.327 

.282 

.246 

.215 

.190 

.169 

.150 

.135 

23 

.695  .564 

.470 

.399 

.343 

.298 

.260 

.229 

.203 

.181 

.162 

.146 

24 

.706  .578 

.486 

.415 

.359 

.313 

.275 

.243 

.216 

.193 

.173 

.156 

25 

.717  .591 

.500 

.430 

.374 

.327 

.289 

.256 

.229 

.205 

.185 

.167 

26 

.727  .604 

.514 

.444 

.388 

.341 

.302 

.269 

.241 

.217 

.196 

.178 

27 

.736  .616 

.527 

.458 

.401 

.355 

.315 

.282 

.253 

.229 

.207 

.188 

28 

.744  .627 

.540 

.471 

.415 

.368 

.328 

.294 

.265 

.240 

.218 

.199 

29 

.752  .638 

.552 

.483 

.427 

.380 

.340 

.306 

.277 

.251 

.229 

.209 

30 

.760  .648 

.563 

.495 

.439 

.392 

.352 

.318 

.288 

.262 

.239 

.219 

40 

.816  .724 

.651 

.591 

.539 

.494 

.454 

.419 

.387 

.359 

.334 

.311 

60 

.875 

.808 

.752 

.704 

.661 

.623 

.587 

.555 

.526 

.498 

.473 

.449 

80 

.905 

.853 

.808 

.769 

.733 

.700 

.670 

.641 

.615 

.590 

.566 

.544 

100 

.924  .881 

.844 

.810 

.780 

.751 

.725 

.700 

.676 

.654 

.632 

.612 

120 

.936  .900 

.868 

.839 

.813 

.788 

.764 

.742 

.721 

.700 

.681 

.663 

140 

.945  .913 

.886 

.861 

.837 

.815 

.794 

.774 

.755 

.736 

.719 

.702 

170 

.955  .928 

.905 

.884 

.864 

.845 

.827 

.809 

.792 

.776 

.761 

.746 

200 

.961 

.939 

.919 

.900 

.883 

.866 

.850 

.835 

.820 

.806 

.792 

.779 

240 

.968  .949 

.932 

.916 

.901 

.887 

.873 

.860 

.848 

.835 

.823 

.811 

320 

.976  .961 

.948 

.936 

.925 

.914 

.903 

.893 

.883 

.873 

.864 

.854 

440 

.982  .972 

.962 

.953 

.945 

.937 

.929 

.921 

.913 

.906 

.899 

.891 

600 

.987  .979 

.972 

.966 

.959 

.953 

.947 

.941 

.936 

.930 

.924 

.919 

800 

.990  .984 

.979 

.974 

.969 

.965 

.960 

.956 

.951 

.947 

.943 

.939 

1000 

.992  .987 

.983 

.979 

.975 

.972 

.968 

.964 

.961 

.957 

.954 

.950 

"  Multiply  entry  by  10  3. 
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Table  A.9.  ( Continued ) 


VH 


VE  1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1  .000 

.000 

.000 

.000 

.000 

P  = 
.000 

=  4 

.000 

.000 

.000 

.000 

.000 

.000 

2.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

3  .000 

.000 

.000 

.000 

.000 

.001" 

.001" 

.001" 

.002" 

.002" 

.002" 

.003' 

4  1.38° 

.292" 

.127" 

.075" 

.052" 

a 

O 

O 

.033" 

.029" 

.026" 

.025" 

.023" 

.022' 

5  .026 

6.09" 

2.31" 

1.13" 

.647" 

.416" 

.292" 

.218" 

.172" 

.141" 

.120" 

.105' 

6.076 

.024 

.010 

5.07" 

2.90" 

1.82" 

1.22" 

.872" 

.652" 

.508" 

.409" 

.338' 

7.135 

.051 

.024 

.013 

7.74" 

4.94" 

3.34" 

2.36" 

1.74" 

1.33" 

1.05" 

.848' 

8.194 

.084 

.043 

.025 

.015 

.010 

6.98" 

4.99" 

3.70" 

2.82" 

2.21" 

1.77" 

9  .249 

.119 

.066 

.040 

.026 

.017 

.012 

8.91" 

6.66" 

5.11" 

4.01" 

3.21" 

10  .298 

.155 

.091 

.057 

.038 

.027 

.019 

.014 

.011 

8.29" 

6.54" 

5.25" 

1 1  .343 

.190 

.117 

.077 

.053 

.037 

.027 

.021 

.016 

.012 

9.84" 

7.95" 

12  .382 

.223 

.143 

.097 

.068 

.049 

.037 

.028 

.022 

.017 

.014 

.011 

13  .418 

.255 

.169 

.117 

.085 

.063 

.047 

.037 

.029 

.023 

.019 

.015 

14  .450 

.286 

.194 

.138 

.102 

.077 

.059 

.046 

.037 

.030 

.024 

.020 

15  .479 

.314 

.219 

.159 

.119 

.091 

.071 

.056 

.045 

.037 

.030 

.025 

16  .506 

.340 

.243 

.180 

.136 

.106 

.083 

.067 

.054 

.044 

.037 

.031 

17  .529 

.365 

.266 

.200 

.154 

.121 

.096 

.078 

.064 

.053 

.044 

.037 

18.551 

.389 

.288 

.219 

.171 

.136 

.109 

.089 

.074 

.061 

.051 

.044 

19  .571 

.410 

.309 

.239 

.188 

.151 

.123 

.101 

.084 

.070 

.059 

.051 

20  .589 

.431 

.329 

.257 

.205 

.166 

.136 

.113 

.094 

.079 

.068 

.058 

21  .606 

.450 

.348 

.275 

.221 

.181 

.149 

.124 

.105 

.089 

.076 

.065 

22  .621 

.468 

.366 

.292 

.237 

.195 

.162 

.136 

.115 

.098 

.085 

.073 

23  .636 

.485 

.383 

.309 

.253 

.210 

.175 

.148 

.126 

.108 

.093 

.081 

24  .649 

.501 

.399 

.325 

.268 

.224 

.188 

.160 

.137 

.118 

.102 

.089 

25  .661 

.516 

.415 

.340 

.283 

.237 

.201 

.172 

.148 

.128 

.111 

.097 

26  .673 

.530 

.430 

.355 

.297 

.251 

.214 

.183 

.158 

.138 

.120 

.106 

27  .684 

.544 

.444 

.369 

.311 

.264 

.226 

.195 

.169 

.147 

.129 

.114 

28  .694 

.556 

.458 

.383 

.324 

.277 

.238 

.206 

.180 

.157 

.138 

.122 

29  .703 

.568 

.471 

.396 

.337 

.289 

.250 

.217 

.190 

.167 

.147 

.131 

30  .712 

.580 

.483 

.409 

.349 

.301 

.261 

.228 

.200 

.177 

.157 

.139 

40  .779 

.668 

.583 

.513 

.455 

.406 

.364 

.327 

.295 

.267 

.243 

.221 

60  .849 

.767 

.700 

.643 

.592 

.547 

.507 

.471 

.438 

.409 

.382 

.357 

80  .885 

.821 

.766 

.718 

.675 

.636 

.600 

.567 

.536 

.508 

.482 

.457 

100  .908 

.854 

.809 

.768 

.730 

.696 

.664 

.634 

.606 

.580 

.555 

.532 

120  .923 

.877 

.838 

.802 

.770 

.739 

.711 

.684 

.658 

.634 

.611 

.590 

140  .934 

.894 

.860 

.828 

.799 

.772 

.746 

.721 

.698 

.676 

.655 

.635 

170  .945 

.912 

.883 

.856 

.831 

.808 

.785 

.764 

.743 

.724 

.705 

.687 

200  .953 

.925 

.900 

.876 

.855 

.834 

.814 

.795 

.777 

.759 

.742 

.726 

240  .961 

.937 

.916 

.896 

.877 

.859 

.842 

.826 

.810 

.795 

.780 

.765 

320  .971 

.952 

.936 

.921 

.907 

.893 

.879 

.866 

.854 

.841 

.829 

.818 

440  .979 

.965 

.953 

.942 

.931 

.921 

.911 

.901 

.891 

.882 

.872 

.863 

600  .984 

.974 

.966 

.957 

.949 

.941 

.934 

.926 

.919 

.912 

.905 

.898 

800  .988 

.981 

.974 

.968 

.961 

.956 

.950 

.944 

.938 

.933 

.927 

.922 

1000  .991 

.985 

.979 

.974 

.969 

.964 

.960 

.955 

.950 

.946 

.941 

.937 

"  Multiply  entry  by  10~3. 
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Table  A.9.  ( Continued ) 


VH 


VE 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1 

.000 

.000 

.000 

.000 

.000 

P  =  5 
.000 

.000 

.000 

.000 

.000 

.000 

.000 

2 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

3 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

4 

.000 

.000 

.000 

.000 

.001“ 

.001“ 

.001“ 

.001“ 

.001“ 

.001“ 

.001“ 

.001' 

5 

1.60“ 

.291“ 

.105“ 

.052“ 

.031“ 

.021“ 

.015“ 

.012“ 

.010“ 

.008“ 

.007“ 

.007' 

6 

.021 

4.39“ 

1.48“ 

.647“ 

.335“ 

.197“ 

.126“ 

.087“ 

.064“ 

.049“ 

.039“ 

.032' 

7 

.063 

.017 

6.36“ 

2.90“ 

1.51“ 

.872“ 

.544“ 

.361“ 

.253“ 

.185“ 

.141“ 

.110' 

8 

.114 

.037 

.016 

7.74“ 

4.21“ 

2.48“ 

1.56“ 

1.03“ 

.716“ 

.516“ 

.385“ 

.296' 

9 

.165 

.063 

.029 

.015 

8.79“ 

5.35“ 

3.43“ 

2.30“ 

1.61“ 

1.16“ 

.861“ 

.657' 

10 

.215 

.092 

.046 

.026 

.015 

9.64“ 

6.34“ 

4.34“ 

3.06“ 

2.22“ 

1.66“ 

1.27“ 

11 

.261 

.122 

.066 

.038 

.024 

.015 

.010 

7.22“ 

5.17“ 

3.80“ 

2.86“ 

2.19“ 

12 

.303 

.153 

.086 

.053 

.034 

.022 

.015 

.011 

7.99“ 

5.95“ 

4.51“ 

3.49“ 

13 

.341 

.183 

.108 

.068 

.045 

.031 

.022 

.016 

.012 

8.68“ 

6.66“ 

5.19“ 

14 

.376 

.212 

.130 

.085 

.057 

.040 

.029 

.021 

.016 

.012 

9.31“ 

7.32“ 

15 

.407 

.239 

.152 

.102 

.070 

.050 

.037 

.027 

.021 

.016 

.012 

9.88“ 

16 

.436 

.266 

.174 

.119 

.084 

.061 

.045 

.034 

.026 

.020 

.016 

.013 

17 

.462 

.291 

.195 

.136 

.098 

.072 

.054 

.042 

.032 

.025 

.020 

.016 

18 

.486 

.315 

.216 

.154 

.113 

.084 

.064 

.050 

.039 

.031 

.025 

.020 

19 

.508 

.337 

.236 

.171 

.127 

.096 

.074 

.058 

.046 

.037 

.030 

.024 

20 

.529 

.359 

.256 

.188 

.142 

.109 

.085 

.067 

.053 

.043 

.035 

.029 

21 

.548 

.379 

.275 

.205 

.156 

.121 

.095 

.076 

.061 

.050 

.041 

.034 

22 

.565 

.398 

.293 

.221 

.171 

.134 

.106 

.085 

.069 

.057 

.047 

.039 

23 

.581 

.416 

.310 

.237 

.185 

.146 

.117 

.095 

.077 

.064 

.053 

.044 

24 

.596 

.433 

.327 

.253 

.199 

.159 

.128 

.104 

.086 

.071 

.060 

.050 

25 

.610 

.449 

.343 

.268 

.213 

.171 

.139 

.114 

.094 

.079 

.066 

.056 

26 

.623 

.465 

.359 

.283 

.226 

.183 

.150 

.124 

.103 

.087 

.073 

.062 

27 

.635 

.479 

.374 

.297 

.239 

.195 

.161 

.134 

.112 

.094 

.080 

.068 

28 

.647 

.493 

.388 

.311 

.252 

.207 

.172 

.143 

.121 

.102 

.087 

.075 

29 

.658 

.506 

.401 

.324 

.265 

.219 

.182 

.153 

.130 

.110 

.094 

.081 

30 

.668 

.519 

.415 

.337 

.277 

.230 

.193 

.163 

.138 

.118 

.102 

.088 

40 

.744 

.617 

.522 

.446 

.384 

.333 

.291 

.255 

.224 

.198 

.176 

.156 

60 

.825 

.729 

.652 

.587 

.531 

.482 

.438 

.400 

.366 

.336 

.308 

.284 

80 

.867 

.791 

.727 

.672 

.623 

.578 

.538 

.502 

.469 

.438 

.410 

.385 

100 

.893 

.830 

.776 

.728 

.685 

.645 

.609 

.576 

.544 

.516 

.489 

.464 

120 

.910 

.856 

.810 

.768 

.730 

.694 

.661 

.631 

.602 

.575 

.549 

.525 

140 

.923 

.876 

.835 

.798 

.763 

.731 

.701 

.673 

.647 

.621 

.598 

.575 

170 

.936 

.897 

.862 

.830 

.801 

.773 

.747 

.722 

.698 

.675 

.654 

.633 

200 

.945 

.912 

.882 

.854 

.828 

.803 

.780 

.758 

.736 

.716 

.696 

.677 

240 

.954 

.926 

.900 

.877 

.855 

.833 

.813 

.793 

.775 

.757 

.739 

.722 

300 

.966 

.944 

.925 

.906 

.889 

.872 

.856 

.841 

.825 

.811 

.797 

.783 

440 

.975 

.959 

.945 

.931 

.918 

.905 

.893 

.881 

.870 

.858 

.847 

.836 

600 

.982 

.970 

.959 

.949 

.939 

.930 

.920 

.911 

.903 

.894 

.885 

.877 

800 

.986 

.977 

.969 

.961 

.954 

.947 

.940 

.933 

.926 

.919 

.913 

.906 

1000 

.989 

.982 

.975 

.969 

.963 

.957 

.951 

.946 

.940 

.935 

.929 

.924 

“  Multiply  entry  by  10  3. 
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Table  A.9.  ( Continued ) 


l  ’H 


VE 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1 

.000 

.000 

.000 

.000 

.000 

v. O 

ii  8 
a,  <=>. 

.000 

.000 

.000 

.000 

.000 

.000 

2 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

3 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

4 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

5 

.007“ 

.002“ 

.001" 

.001" 

.001“ 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

6 

2.04" 

.315“ 

.095" 

.040" 

.021“ 

.012" 

.008“ 

o 

o 

.004“ 

.003" 

.003“ 

.002' 

7 

.019 

3.48“ 

1.05" 

.416" 

.197“ 

.106" 

.063“ 

b 

p 

.027“ 

.020" 

.015“ 

.011' 

8 

.054 

.013 

4.37" 

1.82" 

.872“ 

.465" 

.270“ 

.168" 

.111“ 

.076" 

.055“ 

.041' 

9 

.098 

.029 

.011 

4.94" 

2.48“ 

1.36“ 

.798“ 

.497“ 

.325“ 

.222" 

.157“ 

.115' 

10 

.144 

.050 

.021 

.010 

5.35“ 

3.04“ 

1.83“ 

1.16“ 

.762" 

.521" 

.369" 

.269' 

11 

.189 

.074 

.034 

.017 

9.64“ 

5.67“ 

3.51“ 

2.26" 

1.51“ 

1.05" 

.744“ 

.543' 

12 

.232 

.099 

.049 

.027 

.015 

9.35“ 

5.94“ 

3.92“ 

2.66" 

1.86" 

1.34" 

.983' 

13 

.271 

.126 

.066 

.037 

.022 

.014 

9.17" 

6.17“ 

4.27“ 

3.03" 

2.20" 

1.63“ 

14 

.308 

.152 

.084 

.049 

.031 

.020 

.013 

9.07“ 

6.38“ 

4.59" 

3.37" 

2.52" 

15 

.341 

.179 

.103 

.063 

.040 

.026 

.018 

.013 

9.00" 

6.57" 

4.88" 

3.68" 

16 

.372 

.204 

.122 

.077 

.050 

.034 

.024 

.017 

.012 

8.97" 

6.74" 

5.14“ 

17 

.400 

.229 

.141 

.091 

.061 

.042 

.030 

.021 

.016 

.012 

8.97" 

6.90“ 

18 

.426 

.252 

.160 

.106 

.072 

.051 

.037 

.027 

.020 

.015 

.012 

8.97“ 

19 

.450 

.275 

.179 

.121 

.084 

.060 

.044 

.033 

.025 

.019 

.015 

.011 

20 

.473 

.296 

.197 

.136 

.096 

.070 

.052 

.039 

.030 

.023 

.018 

.014 

21 

.493 

.317 

.215 

.151 

.109 

.080 

.060 

.045 

.035 

.027 

.021 

.017 

22 

.512 

.337 

.233 

.166 

.121 

.090 

.068 

.052 

.041 

.032 

.025 

.020 

23 

.530 

.355 

.250 

.181 

.134 

.101 

.077 

.060 

.047 

.037 

.030 

.024 

24 

.546 

.373 

.266 

.195 

.146 

.111 

.086 

.067 

.053 

.042 

.034 

.028 

25 

.562 

.390 

.282 

.210 

.159 

.122 

.095 

.075 

.060 

.048 

.039 

.032 

26 

.576 

.406 

.298 

.224 

.171 

.133 

.104 

.083 

.066 

.054 

.044 

.036 

27 

.590 

.422 

.313 

.237 

.183 

.143 

.113 

.091 

.073 

.060 

.049 

.040 

28 

.603 

.436 

.327 

.251 

.195 

.154 

.123 

.099 

.080 

.066 

.054 

.045 

29 

.615 

.450 

.341 

.264 

.207 

.165 

.132 

.107 

.088 

.072 

.060 

.050 

30 

.626 

.464 

.355 

.277 

.219 

.175 

.142 

.116 

.095 

.079 

.066 

.055 

40 

.711 

.570 

.467 

.387 

.324 

.273 

.232 

.198 

.170 

.147 

.127 

.110 

60 

.802 

.693 

.608 

.536 

.476 

.424 

.379 

.340 

.305 

.275 

.249 

.225 

80 

.849 

.762 

.690 

.629 

.574 

.526 

.483 

.445 

.410 

.378 

.350 

.324 

100 

.878 

.806 

.745 

.691 

.642 

.599 

.559 

.523 

.489 

.458 

.430 

.404 

120 

.898 

.836 

.783 

.735 

.692 

.652 

.616 

.582 

.551 

.521 

.494 

.468 

140 

.912 

.858 

.811 

.769 

.730 

.694 

.660 

.629 

.599 

.572 

.546 

.521 

170 

.927 

.882 

.842 

.806 

.772 

.740 

.710 

.682 

.656 

.630 

.607 

.584 

200 

.938 

.899 

.864 

.832 

.803 

.774 

.748 

.722 

.698 

.675 

.653 

.632 

240 

.948 

.915 

.886 

.858 

.833 

.808 

.785 

.763 

.741 

.721 

.701 

.682 

320 

.961 

.936 

.913 

.892 

.872 

.852 

.834 

.816 

.799 

.782 

.766 

.750 

440 

.972 

.953 

.936 

.920 

.905 

.890 

.876 

.862 

.849 

.836 

.823 

.811 

600 

.979 

.965 

.953 

.941 

.930 

.918 

.908 

.897 

.887 

.877 

.867 

.857 

800 

.984 

.974 

.964 

.955 

.947 

.938 

.930 

.922 

.914 

.906 

.898 

.891 

1000 

.987 

.979 

.971 

.964 

.957 

.950 

.944 

.937 

.930 

.924 

.918 

.912 

a  Multiply  entry  by  10~3. 
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VH 


VE 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1 

.000 

.000 

.000 

.000 

.000 

p  =  l 
.000 

.000 

.000 

.000 

.000 

.000 

.000 

2 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

3 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

4 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

5 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

6 

.043" 

.006“ 

.002“ 

.001“ 

.001“ 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

7 

2.62“ 

.350“ 

.091“ 

.033“ 

.015“ 

oc 

o 

o 

.005“ 

.003“ 

.002“ 

.002“ 

.001“ 

.001' 

8 

.018 

2.95 “ 

.809“ 

.292“ 

.126“ 

.063“ 

.034“ 

.020“ 

.013“ 

.009“ 

.006“ 

.005' 

9 

.048 

.010 

3.20“ 

1.22“ 

.543“ 

.270“ 

.147“ 

-sD 

GC 

o 

.053“ 

.035“ 

.024“ 

.017' 

10 

.087 

.023 

8.07“ 

3.34“ 

1.56“ 

.798“ 

.440“ 

.259“ 

.160“ 

.104“ 

.070“ 

.049' 

11 

.128 

.040 

.016 

6.97“ 

3.43“ 

1.83“ 

1.04“ 

.619“ 

.387“ 

.252“ 

.170“ 

.119' 

12 

.170 

.060 

.026 

.012 

6.34“ 

3.51“ 

2.05“ 

1.25“ 

.796“ 

.525“ 

.357“ 

.249' 

13 

.209 

.083 

.038 

.019 

.010 

5.94“ 

3.57“ 

2.23“ 

1.45“ 

.967“ 

.665“ 

.468' 

14 

.246 

.106 

.052 

.027 

.015 

9.17“ 

5.67“ 

3.63“ 

2.40“ 

1.62“ 

1.13“ 

.804' 

15 

.281 

.129 

.067 

.037 

.022 

.013 

8.37“ 

5.48“ 

3.68“ 

2.54“ 

1.79“ 

1.28“ 

16 

.313 

.153 

.083 

.047 

.029 

.018 

.012 

7.80“ 

5.34“ 

3.73“ 

2.66“ 

1.94“ 

17 

.343 

.176 

.099 

.059 

.037 

.024 

.016 

.011 

7.38“ 

5.24“ 

3.78“ 

2.78“ 

18 

.370 

.199 

.116 

.071 

.045 

.030 

.020 

.014 

9.81“ 

7.06“ 

5.16“ 

3.83“ 

19 

.396 

.221 

.133 

.083 

.054 

.037 

.025 

.018 

.013 

9.20“ 

6.80“ 

5.10“ 

20 

.420 

.242 

.149 

.096 

.064 

.044 

.031 

.022 

.016 

.012 

8.72“ 

6.60“ 

21 

.442 

.263 

.166 

.109 

.074 

.052 

.037 

.026 

.019 

.014 

.011 

8.34“ 

22 

.462 

.283 

.183 

.123 

.085 

.060 

.043 

.031 

.023 

.018 

.013 

.010 

23 

.482 

.301 

.199 

.136 

.095 

.068 

.050 

.037 

.028 

.021 

.016 

.013 

24 

.499 

.320 

.215 

.149 

.106 

.077 

.057 

.042 

.032 

.025 

.019 

.015 

25 

.516 

.337 

.230 

.162 

.117 

.086 

.064 

.048 

.037 

.029 

.022 

.018 

26 

.532 

.354 

.246 

.175 

.128 

.095 

.071 

.055 

.042 

.033 

.026 

.020 

27 

.547 

.370 

.260 

.188 

.139 

.104 

.079 

.061 

.047 

.037 

.029 

.024 

28 

.561 

.385 

.275 

.201 

.150 

.113 

.087 

.068 

.053 

.042 

.033 

.027 

29 

.574 

.399 

.289 

.214 

.161 

.123 

.095 

.074 

.059 

.047 

.037 

.030 

30 

.586 

.413 

.302 

.226 

.172 

.132 

.103 

.081 

.064 

.052 

.042 

.034 

40 

.679 

.526 

.417 

.335 

.273 

.224 

.185 

.154 

.128 

.108 

.091 

.077 

60 

.779 

.660 

.566 

.490 

.426 

.373 

.327 

.288 

.254 

.225 

.200 

.178 

80 

.832 

.735 

.656 

.588 

.530 

.479 

.434 

.394 

.358 

.326 

.298 

.272 

100 

.864 

.783 

.715 

.656 

.603 

.556 

.513 

.475 

.439 

.408 

.378 

.352 

120 

.886 

.817 

.757 

.704 

.657 

.613 

.574 

.537 

.504 

.473 

.444 

.418 

140 

.902 

.841 

.788 

.741 

.698 

.658 

.621 

.587 

.556 

.526 

.498 

.472 

170 

.919 

.868 

.823 

.782 

.744 

.709 

.676 

.645 

.616 

.589 

.563 

.539 

200 

.931 

.887 

.848 

.812 

.778 

.747 

.717 

.689 

.662 

.637 

.613 

.590 

240 

.942 

.905 

.871 

.841 

.812 

.784 

.758 

.733 

.709 

.687 

.665 

.644 

320 

.957 

.928 

.902 

.878 

.855 

.833 

.812 

.792 

.773 

.754 

.736 

.719 

440 

.968 

.947 

.928 

.910 

.893 

.876 

.860 

.844 

.829 

.814 

.800 

.786 

600 

.977 

.961 

.947 

.933 

.920 

.908 

.895 

.883 

.872 

.860 

.849 

.838 

800 

.982 

.971 

.960 

.950 

.940 

.930 

.920 

.911 

.902 

.893 

.884 

.876 

1000 

.986 

.977 

.968 

.959 

.951 

.943 

.936 

.928 

.921 

.914 

.906 

.899 

“  Multiply  entry  by  10  3. 
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Table  A.9.  ( Continued ) 


l  ’H 


VE 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

1 

.000 

.000 

.000 

.000 

.000 

OO 

H  8 
a,  <=>. 

.000 

.000 

.000 

.000 

.000 

.000 

2 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

3 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

4 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

5 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

6 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

7 

.138" 

.015" 

.004" 

.001" 

.001" 

.000 

.000 

.000 

.000 

.000 

.000 

.000 

8 

3.30" 

.393" 

.090" 

.029" 

.012" 

.006" 

.003" 

.002" 

.001" 

.001" 

.001" 

.000 

9 

.017 

2.63" 

.659" 

.218" 

.087" 

.040" 

.020" 

.011" 

.007" 

.004" 

.003" 

.002' 

10 

.044 

8.63" 

2.46" 

.872" 

.361" 

.168" 

.086" 

.047" 

.028" 

.017" 

.011" 

.008' 

11 

.078 

.019 

6.15" 

2.36" 

1.03" 

.497" 

.259" 

.144" 

.085" 

.052" 

.034" 

.023' 

12 

.116 

.033 

.012 

4.99" 

2.30" 

1.16" 

.619" 

.351" 

.209" 

.130" 

.084" 

.056' 

13 

.154 

.051 

.020 

8.91" 

4.34" 

2.26" 

1.25" 

.727" 

.441" 

.278" 

.181" 

.122' 

14 

.190 

.070 

.030 

.014 

7.22" 

3.92" 

2.23" 

1.33" 

.824" 

.527" 

.347" 

.235' 

15 

.225 

.090 

.041 

.021 

.011 

6.17" 

3.63" 

2.22" 

1.40" 

.910" 

.608" 

.416' 

16 

.258 

.111 

.054 

.028 

.016 

9.06" 

5.48" 

3.42" 

2.20" 

1.46" 

.987" 

.683' 

17 

.289 

.133 

.067 

.037 

.021 

.013 

7.80" 

4.98" 

3.27" 

2.20" 

1.51“ 

1.06" 

18 

.318 

.154 

.082 

.046 

.027 

.017 

.011 

6.92" 

4.62" 

3.15" 

2.19" 

1.56" 

19 

.345 

.175 

.096 

.056 

.034 

.021 

.014 

9.23" 

6.26" 

4.34" 

3.06" 

2.19" 

20 

.370 

.195 

.111 

.067 

.042 

.027 

.018 

.012 

8.22" 

5.77" 

4.12" 

2.99" 

21 

.393 

.215 

.127 

.078 

.050 

.033 

.022 

.015 

.010 

7.46" 

5.39" 

3.95" 

22 

.415 

.235 

.142 

.089 

.058 

.039 

.026 

.018 

.013 

9.40" 

6.86" 

5.08" 

23 

.436 

.254 

.157 

.101 

.067 

.045 

.031 

.022 

.016 

.012 

8.56" 

6.39" 

24 

.455 

.272 

.172 

.113 

.076 

.052 

.037 

.026 

.019 

.014 

.010 

7.88" 

25 

.473 

.289 

.187 

.124 

.085 

.060 

.042 

.031 

.023 

.017 

.013 

9.56" 

26 

.490 

.306 

.201 

.136 

.095 

.067 

.048 

.035 

.026 

.020 

.015 

.011 

27 

.505 

.322 

.215 

.148 

.104 

.075 

.055 

.040 

.030 

.023 

.017 

.013 

28 

.520 

.338 

.229 

.160 

.114 

.083 

.061 

.045 

.034 

.026 

.020 

.016 

29 

.534 

.353 

.243 

.172 

.124 

.091 

.068 

.051 

.039 

.030 

.023 

.018 

30 

.548 

.367 

.256 

.183 

.134 

.099 

.074 

.056 

.043 

.034 

.026 

.021 

40 

.649 

.485 

.372 

.290 

.229 

.182 

.146 

.118 

.096 

.079 

.065 

.054 

60 

.758 

.627 

.527 

.447 

.381 

.327 

.282 

.244 

.212 

.184 

.161 

.141 

80 

.815 

.709 

.623 

.551 

.489 

.435 

.389 

.348 

.313 

.281 

.253 

.229 

100 

.851 

.761 

.687 

.622 

.566 

.516 

.471 

.431 

.395 

.362 

.333 

.306 

120 

.875 

.798 

.732 

.675 

.623 

.577 

.535 

.496 

.461 

.429 

.399 

.372 

140 

.892 

.825 

.767 

.715 

.667 

.625 

.585 

.549 

.515 

.484 

.455 

.428 

170 

.911 

.854 

.804 

.759 

.717 

.679 

.644 

.610 

.579 

.550 

.523 

.497 

200 

.924 

.875 

.831 

.791 

.755 

.720 

.688 

.657 

.629 

.602 

.576 

.551 

240 

.936 

.895 

.858 

.823 

.791 

.761 

.732 

.705 

.679 

.655 

.631 

.609 

320 

.952 

.920 

.891 

.865 

.839 

.815 

.792 

.770 

.748 

.728 

.708 

.689 

440 

.965 

.942 

.920 

.900 

.880 

.862 

.844 

.827 

.810 

.794 

.778 

.762 

600 

.974 

.957 

.941 

.926 

.911 

.897 

.883 

.870 

.857 

.844 

.831 

.819 

800 

.981 

.968 

.955 

.944 

.933 

.922 

.911 

.901 

.890 

.880 

.871 

.861 

1000 

.985 

.974 

.964 

.955 

.946 

.937 

.928 

.920 

.911 

.903 

.895 

.887 

Multiply  entry  by  10  3. 
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Table  A.10.  Upper  Critical  Values  for  Roy’s  Test,  a  =  .05 


Roy’s  test  statistic  is  given  by 

Xi 

6  = 

1  +M 

where  Ai  is  the  largest  eigenvalue  of  E  *H.  The  parameters  are 

\Vu  —  p\  —  1 

i  =  min(vtf ,  p ),  m  =  - — - ,  N  = 

Reject  H0  if  9  >  table  value. 

1 

1 

■  1 

m 

N 

0 

1 

2 

3 

4 

5 

7 

10 

15 

s  = 

=  2 

5 

.565 

.651 

.706 

.746 

.776 

.799 

.834 

.868 

.901 

10 

.374 

.455 

.514 

.561 

.598 

.629 

.679 

.732 

.789 

15 

.278 

.348 

.402 

.446 

.483 

.515 

.567 

.627 

.696 

20 

.221 

.281 

.329 

.369 

.404 

.434 

.486 

.546 

.620 

25 

.184 

.236 

.278 

.314 

.346 

.375 

.424 

.484 

.558 

30 

.157 

.203 

.241 

.274 

.303 

.330 

.376 

.433 

.507 

40 

.122 

.159 

.190 

.218 

.243 

.266 

.306 

.359 

.428 

50 

.099 

.130 

.157 

.180 

.202 

.222 

.259 

.306 

.370 

60 

.084 

.110 

.133 

.154 

.173 

.191 

.223 

.266 

.326 

80 

.064 

.085 

.103 

.119 

.135 

.149 

.176 

.211 

.263 

120 

.043 

.058 

.070 

.082 

.093 

.104 

.123 

.150 

.190 

240 

.022 

.030 

.036 

.042 

.048 

.054 

.065 

.080 

.103 

5  = 

=  3 

5 

.669 

.729 

.770 

.800 

.822 

.840 

.867 

.894 

.920 

10 

.472 

.537 

.586 

.625 

.656 

.683 

.725 

.770 

.819 

15 

.362 

.422 

.469 

.508 

.541 

.569 

.616 

.669 

.730 

20 

.293 

.346 

.390 

.427 

.458 

.486 

.533 

.589 

.656 

25 

.246 

.294 

.333 

.367 

.397 

.424 

.470 

.525 

.594 

30 

.212 

.255 

.291 

.322 

.350 

.375 

.419 

.473 

.543 

40 

.166 

.201 

.232 

.259 

.283 

.305 

.345 

.395 

.462 

50 

.136 

.167 

.192 

.216 

.237 

.257 

.292 

.339 

.402 

60 

.116 

.142 

.164 

.185 

.204 

.221 

.254 

.296 

.355 

80 

.089 

.109 

.127 

.144 

.160 

.174 

.201 

.237 

.288 

120 

.061 

.075 

.088 

.100 

.111 

.122 

.142 

.169 

.209 

240 

.031 

.039 

.046 

.052 

.058 

.064 

.075 

.090 

.114 

TABLES 
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Table  A.10.  ( Continued ) 


m 


N 

0 

1 

2 

3 

4 

5 

7 

10 

15 

s 

=  4 

5 

.739 

.782 

.813 

.836 

.854 

.868 

.889 

.911 

.933 

10 

.547 

.601 

.641 

.674 

.700 

.723 

.759 

.798 

.840 

15 

.431 

.482 

.523 

.558 

.587 

.612 

.654 

.701 

.756 

20 

.354 

.402 

.441 

.474 

.503 

.529 

.572 

.623 

.684 

25 

.301 

.344 

.380 

.412 

.440 

.464 

.507 

.559 

.624 

30 

.261 

.301 

.334 

.364 

.390 

.414 

.455 

.507 

.572 

40 

.207 

.240 

.269 

.294 

.318 

.339 

.377 

.426 

.490 

50 

.171 

.199 

.224 

.247 

.268 

.287 

.322 

.367 

.428 

60 

.145 

.170 

.193 

.213 

.232 

.249 

.280 

.322 

.380 

80 

.112 

.132 

.150 

.167 

.182 

.196 

.223 

.259 

.309 

120 

.077 

.091 

.104 

.116 

.127 

.138 

.158 

.185 

.226 

240 

.040 

.047 

.054 

.061 

.067 

.073 

.084 

.100 

.124 

s 

=  5 

5 

.788 

.821 

.845 

.863 

.877 

.888 

.906 

.924 

.942 

10 

.607 

.651 

.685 

.713 

.735 

.755 

.786 

.820 

.857 

15 

.488 

.533 

.569 

.599 

.625 

.648 

.685 

.728 

.777 

20 

.407 

.449 

.485 

.515 

.542 

.565 

.604 

.651 

.708 

25 

.349 

.388 

.422 

.451 

.477 

.500 

.540 

.588 

.648 

30 

.305 

.341 

.373 

.400 

.425 

.448 

.487 

.535 

.597 

40 

.243 

.275 

.302 

.327 

.349 

.370 

.406 

.453 

.514 

50 

.202 

.230 

.254 

.276 

.296 

.315 

.348 

.392 

.451 

60 

.173 

.197 

.219 

.238 

.257 

.274 

.304 

.345 

.401 

80 

.134 

.154 

.171 

.188 

.203 

.217 

.243 

.278 

.329 

120 

.093 

.107 

.120 

.132 

.143 

.154 

.174 

.201 

.241 

240 

.048 

.056 

.063 

.069 

.076 

.082 

.093 

.109 

.134 

(c continued ) 


576 


TABLES 


Table  A.10.  ( Continued ) 


m 


N 

0 

1 

2 

3 

4 

5 

7 

10 

15 

s 

=  6 

5 

.825 

.850 

.869 

.883 

.895 

.904 

.918 

.934 

.949 

10 

.655 

.692 

.721 

.744 

.764 

.781 

.808 

.838 

.871 

15 

.537 

.576 

.608 

.635 

.658 

.678 

.711 

.750 

.795 

20 

.454 

.491 

.523 

.551 

.575 

.596 

.632 

.676 

.728 

25 

.392 

.428 

.458 

.485 

.509 

.531 

.568 

.613 

.669 

30 

.345 

.378 

.407 

.433 

.457 

.478 

.514 

.560 

.618 

40 

.278 

.307 

.333 

.356 

.378 

.397 

.432 

.477 

.536 

50 

.232 

.258 

.281 

.302 

.322 

.340 

.372 

.414 

.472 

60 

.200 

.223 

.243 

.262 

.280 

.297 

.327 

.366 

.421 

80 

.156 

.174 

.192 

.208 

.222 

.236 

.262 

.297 

.346 

120 

.108 

.122 

.134 

.146 

.157 

.168 

.188 

.215 

.255 

240 

.056 

.064 

.071 

.078 

.084 

.090 

.101 

.118 

.142 

=  7 

5 

.852 

.872 

.887 

.899 

.908 

.917 

.929 

.941 

.955 

10 

.695 

.726 

.750 

.771 

.788 

.802 

.826 

.853 

.882 

15 

.579 

.613 

.641 

.665 

.686 

.704 

.734 

.769 

.810 

20 

.494 

.528 

.557 

.582 

.604 

.624 

.657 

.697 

.745 

25 

.431 

.463 

.491 

.516 

.538 

.558 

.593 

.635 

.688 

30 

.381 

.412 

.439 

.463 

.485 

.505 

.540 

.583 

.638 

40 

.309 

.337 

.362 

.384 

.404 

.423 

.456 

.499 

.555 

60 

.224 

.246 

.266 

.285 

.302 

.318 

.347 

.386 

.439 

80 

.176 

.194 

.211 

.226 

.241 

.255 

.280 

.314 

.363 

100 

.145 

.160 

.175 

.188 

.200 

.212 

.235 

.265 

.310 

200 

.077 

.085 

.093 

.101 

.109 

.116 

.129 

.148 

.175 

300 

.052 

.058 

.064 

.069 

.074 

.079 

.089 

.103 

.125 

500 

.032 

.036 

.039 

.042 

.046 

.049 

.055 

.064 

.078 

1000 

.016 

.018 

.020 

.022 

.023 

.025 

.028 

.033 

.041 
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Table  A.10.  ( Continued ) 


m 


N 

0 

1 

2 

3 

4 

5 

7 

10 

15 

5 

.874 

.890 

.902 

s 

.912 

=  8 

.920 

.927 

.937 

.948 

.959 

10 

.728 

.754 

.775 

.793 

.808 

.821 

.842 

.865 

.892 

15 

.615 

.645 

.670 

.692 

.710 

.727 

.754 

.786 

.824 

20 

.531 

.561 

.587 

.610 

.630 

.648 

.679 

.716 

.761 

25 

.466 

.495 

.521 

.544 

.565 

.583 

.616 

.655 

.705 

30 

.414 

.443 

.468 

.491 

.511 

.530 

.563 

.603 

.655 

40 

.339 

.365 

.388 

.409 

.428 

.446 

.478 

.519 

.573 

60 

.248 

.269 

.288 

.306 

.323 

.338 

.367 

.404 

.456 

80 

.195 

.213 

.229 

.244 

.259 

.272 

.297 

.330 

.378 

100 

.161 

.176 

.190 

.203 

.216 

.228 

.250 

.279 

.323 

200 

.086 

.094 

.103 

.110 

.118 

.125 

.138 

.157 

.185 

300 

.058 

.065 

.070 

.076 

.081 

.086 

.096 

.109 

.130 

500 

.036 

.040 

.043 

.047 

.050 

.053 

.059 

.068 

.081 

1000 

.018 

.020 

.022 

.024 

.025 

.027 

.030 

.035 

.042 

s 

=  9 

5 

.891 

.904 

.914 

.922 

.929 

.935 

.944 

.953 

.963 

10 

.756 

.778 

.797 

.812 

.825 

.837 

.855 

.876 

.901 

15 

.647 

.674 

.696 

.715 

.732 

.747 

.771 

.801 

.835 

20 

.563 

.591 

.614 

.635 

.654 

.670 

.698 

.733 

.775 

25 

.497 

.525 

.549 

.570 

.589 

.606 

.636 

.673 

.720 

30 

.445 

.471 

.495 

.516 

.535 

.552 

.583 

.622 

.671 

40 

.366 

.391 

.413 

.433 

.451 

.468 

.499 

.538 

.590 

60 

.270 

.291 

.309 

.326 

.343 

.358 

.385 

.421 

.472 

80 

.214 

.231 

.247 

.262 

.276 

.289 

.313 

.346 

.392 

100 

.177 

.192 

.206 

.219 

.231 

.242 

.264 

.293 

.336 

200 

.095 

.104 

.112 

.119 

.127 

.134 

.147 

.166 

.194 

300 

.065 

.071 

.077 

.082 

.087 

.092 

.102 

.115 

.136 

500 

.040 

.043 

.047 

.051 

.054 

.057 

.063 

.072 

.086 

1000 

.020 

.022 

.024 

.026 

.028 

.029 

.032 

.037 

.044 

s  ■ 

=  10 

5 

.905 

.916 

.924 

.931 

.937 

.941 

.949 

.958 

.967 

10 

.780 

.799 

.815 

.829 

.840 

.851 

.867 

.886 

.908 

15 

.675 

.699 

.719 

.736 

.751 

.764 

.787 

.814 

.846 

20 

.592 

.617 

.639 

.658 

.675 

.690 

.716 

.748 

.787 

25 

.526 

.551 

.573 

.593 

.611 

.627 

.655 

.690 

.734 

30 

.473 

.497 

.519 

.539 

.557 

.573 

.603 

.639 

.686 

40 

.392 

.415 

.436 

.455 

.473 

.489 

.518 

.555 

.605 

60 

.292 

.311 

.329 

.346 

.361 

.376 

.402 

.438 

.487 

80 

.232 

.249 

.264 

.278 

.292 

.305 

.329 

.361 

.406 

100 

.193 

.207 

.220 

.233 

.245 

.256 

.278 

.306 

.348 

200 

.104 

.112 

.120 

.128 

.135 

.142 

.156 

.174 

.202 

300 

.071 

.077 

.083 

.088 

.093 

.098 

.108 

.122 

.143 

500 

.044 

.047 

.051 

.054 

.058 

.061 

.067 

.076 

.090 

1000 

.022 

.024 

.026 

.028 

.030 

.031 

.034 

.039 

.047 
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Table  A.ll.  Upper  Critical  Values  of  Pillai’s  Statistic  V1'1,  a  =  .05 


^  =  E 

(=1 


1  +  Xi 


where  7.!,  X2, . . .  ,  Xs  are  eigenvalues  of  E  H.  Reject  H0  if  V(s)  exceeds  table  value.  The  parameters  s,  m,  and  N  are  defined  in  Table  A.10. 


N 


m 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

15 

20 

25 

0 

1.536 

1.232 

1.031 

.890 

.782 

.698 

s  =  2 

.629 

.573 

.526 

.485 

.451 

.333 

.263 

.218 

1 

1.706 

1.452 

1.258 

1.109 

.991 

.896 

.817 

.751 

.694 

.646 

.604 

.455 

.364 

.304 

2 

1.784 

1.573 

1.397 

1.254 

1.137 

1.039 

.956 

.886 

.825 

.772 

.725 

.556 

.451 

.379 

3 

1.829 

1.649 

1.492 

1.358 

1.245 

1.149 

1.065 

.993 

.930 

.875 

.825 

.643 

.526 

.445 

4 

1.859 

1.703 

1.560 

1.436 

1.329 

1.235 

1.153 

1.081 

1.018 

.961 

.910 

.719 

1.594 

.506 

5 

1.880 

1.742 

1.613 

1.497 

1.395 

1.305 

1.226 

1.155 

1.091 

1.034 

.983 

.786 

.655 

.561 

6 

1.895 

1.772 

1.654 

1.546 

1.450 

1.364 

1.286 

1.217 

1.154 

1.098 

1.046 

.846 

.710 

.612 

7 

1.907 

1.796 

1.687 

1.586 

1.495 

1.413 

1.338 

1.270 

1.209 

1.153 

1.102 

.901 

.761 

.658 

8 

1.917 

1.815 

1.714 

1.620 

1.534 

1.455 

1.383 

1.317 

1.257 

1.202 

1.151 

.950 

.808 

.702 

9 

1.924 

1.831 

1.737 

1.649 

1.567 

1.491 

1.422 

1.358 

1.299 

1.245 

1.195 

.995 

.851 

.743 

10 

1.931 

1.844 

1.757 

1.673 

1.595 

1.523 

1.456 

1.394 

1.337 

1.284 

1.235 

1.036 

.891 

.781 

15 

1.951 

1.888 

1.822 

1.758 

1.695 

1.636 

1.580 

1.527 

1.477 

1.430 

1.386 

20 

1.963 

1.913 

1.860 

1.807 

1.756 

1.706 

1.658 

1.612 

1.568 

1.527 

1.487 

25 

1.969 

1.929 

1.885 

1.840 

1.796 

1.753 

1.711 

1.671 

1.632 

1.595 

1.559 

579 


5=3 


0 

2.037 

1.710 

1.473 

1.294 

1.153 

1.040 

.947 

.869 

.803 

.746 

.697 

.524 

.420 

.350 

1 

2.297 

1.988 

1.751 

1.564 

1.412 

1.287 

1.183 

1.094 

1.017 

.950 

.892 

.682 

.552 

.453 

2 

2.447 

2.168 

1.943 

1.759 

1.606 

1.477 

1.367 

1.273 

1.190 

1.117 

1.053 

.818 

.668 

.565 

3 

2.544 

2.294 

2.084 

1.907 

1.757 

1.628 

1.517 

1.420 

1.334 

1.258 

1.190 

.937 

.772 

.656 

4 

2.612 

2.386 

2.191 

2.023 

1.878 

1.752 

1.641 

1.543 

1.456 

1.378 

1.308 

1.042 

.866 

.740 

5 

2.662 

2.457 

2.276 

2.117 

1.978 

1.854 

1.745 

1.648 

1.561 

1.482 

1.411 

1.137 

.952 

.818 

6 

2.701 

2.514 

2.345 

2.194 

2.061 

1.941 

1.835 

1.739 

1.652 

1.573 

1.502 

1.222 

1.030 

.890 

7 

2.732 

2.559 

2.402 

2.259 

2.131 

2.016 

1.912 

1.818 

1.732 

1.654 

1.582 

1.300 

1.103 

.957 

8 

2.757 

2.597 

2.449 

2.314 

2.192 

2.081 

1.979 

1.887 

1.803 

1.726 

1.655 

1.371 

1.170 

1.020 

9 

2.777 

2.629 

2.490 

2.362 

2.244 

2.137 

2.039 

1.949 

1.866 

1.790 

1.720 

1.436 

1.23 

10 

2.795 

2.656 

2.525 

2.403 

2.291 

2.187 

2.092 

2.004 

1.923 

1.848 

1.779 

1.496 

1.3 

15 

2.853 

2.748 

2.646 

2.549 

2.457 

2.370 

2.288 

2.211 

21.39 

2.071 

2.007 

20 

2.885 

2.802 

2.718 

2.637 

2.560 

2.485 

2.414 

2.347 

2.283 

2.222 

2.163 

25 

2.906 

2.836 

2.766 

2.697 

2.630 

2.565 

2.503 

2.443 

2.385 

(< continued ) 
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Table  A.  11.  (Continued) 


N 


m 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

15 

20 

0 

2.549 

2.194 

1.926 

1.717 

1.548 

1.410 

s  —  4 
1.294 

1.196 

1.112 

1.038 

.974 

.744 

.602 

1 

2.852 

2.510 

2.241 

2.023 

1.844 

1.693 

1.566 

1.456 

1.360 

1.277 

1.203 

.932 

.761 

2 

3.052 

2.733 

2.472 

2.256 

2.074 

1.919 

1.786 

1.670 

1.567 

1.477 

1.396 

1.097 

.903 

3 

3.193 

2.898 

2.650 

2.440 

2.260 

2.104 

1.969 

1.849 

1.743 

1.649 

1.564 

1.243 

1.032 

4 

3.298 

3.025 

2.791 

2.589 

2.413 

2.259 

2.123 

2.002 

1.895 

1.798 

1.710 

1.375 

1.149 

5 

3.378 

3.126 

2.905 

2.711 

2.541 

2.390 

2.255 

2.135 

2.027 

1.929 

1.840 

1.494 

6 

3.442 

3.208 

2.999 

2.814 

2.649 

2.502 

2.370 

2.251 

2.143 

2.044 

1.955 

1.602 

7 

3.494 

3.276 

3.079 

2.902 

2.743 

2.600 

2.470 

2.353 

2.246 

2.148 

2.058 

1.70 

8 

3.537 

3.333 

3.146 

2.977 

2.824 

2.685 

2.559 

2.444 

2.338 

2.241 

2.151 

1.8 

9 

3.574 

3.382 

3.205 

3.043 

2.896 

2.761 

2.638 

2.525 

2.421 

2.325 

2.236 

10 

3.605 

3.424 

3.256 

3.101 

2.959 

2.829 

2.708 

2.598 

2.496 

2.401 

2.313 

15 

3.710 

3.570 

3.436 

3.310 

3.191 

3.079 

2.974 

2.876 

2.783 

2.696 

2.615 

20 

3.771 

3.657 

3.546 

3.440 

3.338 

3.241 

3.149 

Table  A.ll.  ( Continued ) 


N 


m 

0 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

s  =  5 

0 

3.055 

2.681 

2.389 

2.155 

1.962 

1.801 

1.664 

1.547 

1.445 

1.356 

1.277 

1 

3.390 

3.025 

2.731 

2.488 

2.285 

2.122 

1.964 

1.835 

1.722 

1.622 

1.533 

2 

3.628 

3.281 

2.993 

2.751 

2.545 

2.367 

2.213 

2.077 

1.957 

1.850 

1.754 

3 

3.805 

3.478 

3.201 

2.964 

2.759 

2.580 

2.423 

2.284 

2.160 

2.048 

1.948 

4 

3.941 

3.635 

3.370 

3.140 

2.938 

2.761 

2.604 

2.463 

2.337 

2.222 

2.119 

5 

4.050 

3.762 

3.510 

3.288 

3.091 

2.916 

2.760 

2.619 

2.492 

2.377 

2.271 

6 

4.138 

3.868 

3.627 

3.414 

3.223 

3.052 

2.897 

2.758 

2.630 

2.514 

2.408 

7 

4.212 

3.957 

3.728 

3.522 

3.337 

3.170 

3.018 

2.880 

8 

4.274 

4.033 

3.815 

3.617 

3.438 

3.275 

3.126 

9 

4.327 

4.099 

3.890 

3.700 

3.527 

3.369 

10 

4.372 

4.156 

3.957 

3.774 

3.607 

3.45 

0 

3.559 

3.171 

2.859 

2.604 

2.390 

2.209 

S  —  D 

2.053 

1.918 

1.799 

1.694 

1.601 

1 

3.917 

3.535 

3.221 

2.958 

2.734 

2.542 

2.375 

2.229 

2.099 

1.984 

1.881 

2 

4.185 

3.817 

3.508 

3.245 

3.018 

2.821 

2.647 

2.494 

2.358 

2.235 

2.125 

3 

4.391 

4.041 

3.741 

3.482 

3.256 

3.057 

2.881 

2.724 

2.583 

2.456 

2.341 

4 

4.556 

4.223 

3.934 

3.681 

3.458 

3.260 

3.084 

2.925 

2.782 

2.652 

2.534 

5 

4.690 

4.375 

4.097 

3.851 

3.633 

3.438 

3.262 

3.103 

2.959 

2.827 

2.706 

6 

4.802 

4.502 

4.236 

3.998 

3.785 

7 

4.896 

4.611 

4.356 

4.126 

3.919 

8 

4.976 

4.706 

4.461 

4.239 

9 

5.045 

4.788 

4.553 

10 

5.106 

4.860 

4.635 
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Table  A.12.  Upper  Critical  Values  for  the  Lawley-Hotelling  Test  Statistic,  a  =  .05 


The  test  statistic  is  vEU(s) /vH,  where  t/(s)  is  the  Lawley-Hotelling  statistic.  Reject  H0  if  vEU^ /vH  >  table  value. 


Vh 


VE 

2 

3 

4 

5 

6 

8 

10 

12 

15 

20 

25 

40 

60 

2“ 

9.8591 

10.659 

11.098 

11.373 

11.562 

11.952 

P  =  2 
11.804 

12.052 

12.153 

12.254 

12.316 

12.409 

12.461 

3 

58.428 

58.915 

59.161 

59.308 

59.407 

59.531 

59.606 

59.655 

59.705 

59.755 

59.785 

59.830 

59.855 

4 

23.999 

23.312 

22.918 

22.663 

22.484 

22.250 

22.104 

22.003 

21.901 

21.797 

21.733 

21.636 

21.582 

5 

15.639 

14.864 

14.422 

14.135 

13.934 

13.670 

13.504 

13.391 

13.275 

13.156 

13.083 

12.972 

12.909 

6 

12.175 

11.411 

10.975 

10.691 

10.491 

10.228 

10.063 

9.9489 

9.8320 

9.7118 

9.6381 

9.5251 

9.4610 

7 

10.334 

9.5937 

9.1694 

8.8927 

8.6975 

8.4396 

8.2765 

8.16399 

8.0480 

7.9285 

7.8549 

7.7417 

7.6773 

8 

9.2069 

8.4881 

8.0752 

7.8054 

7.6145 

7.3614 

7.2008 

7.0896 

6.9748 

6.8560 

6.7826 

6.6694 

6.6048 

10 

7.9095 

7.2243 

6.8294 

6.5702 

6.3860 

6.1405 

5.9837 

5.8745 

5.7612 

5.6433 

5.5701 

5.4564 

5.3910 

12 

7.1902 

6.5284 

6.1461 

5.8942 

5.7147 

5.4744 

5.3200 

5.2122 

5.0997 

4.9820 

4.9085 

4.7938 

4.7274 

14 

6.7350 

6.0902 

4.7168 

5.4703 

5.2941 

5.0574 

4.9048 

4.7977 

4.6856 

4.5678 

4.4939 

4.3780 

4.3105 

16 

6.4217 

5.7895 

5.4230 

5.1804 

5.0067 

4.7727 

4.6213 

4.5147 

4.4028 

4.2846 

4.2102 

4.0930 

4.0243 

18 

6.1932 

5.5708 

5.2095 

4.9700 

4.7982 

4.5663 

4.4157 

4.3094 

4.1976 

4.0791 

4.0042 

3.8855 

3.8158 

20 

6.0192 

5.4046 

5.0475 

4.8105 

4.6402 

4.4099 

4.2600 

4.1539 

4.0420 

3.9231 

3.8477 

3.7278 

3.6569 

25 

5.7244 

5.1237 

4.7741 

2.5415 

4.3740 

4.1465 

3.9977 

3.8919 

3.7798 

3.6598 

3.5832 

3.4605 

3.3868 

30 

5.5401 

4.9487 

4.6040 

4.3743 

4.2086 

3.9829 

3.8347 

3.7291 

3.6166 

3.4957 

3.4181 

3.2926 

3.2168 

35 

5.4140 

4.8291 

4.8880 

4.2604 

4.0959 

3.8715 

3.7237 

3.6181 

3.5054 

3.3836 

3.3051 

3.1774 

3.1000 

40 

5.3224 

4.7424 

4.4039 

4.1778 

4.0143 

3.7908 

3.6433 

3.5377 

3.4247 

3.3022 

3.2230 

3.0933 

3.0140 

50 

5.1981 

4.6249 

4.2900 

4.0661 

3.9039 

3.6817 

3.5346 

3.4289 

3.3154 

3.1919 

3.1115 

2.9787 

2.8965 

60 

5.1178 

4.5490 

4.2166 

3.9941 

3.8328 

3.6114 

3.4646 

3.3588 

3.2450 

3.1206 

3.0392 

2.9041 

2.8196 

70 

5.0616 

4.4960 

4.1653 

3.9439 

3.7831 

3.5624 

3.4157 

3.3099 

3.1957 

3.0706 

2.9886 

2.8516 

2.7652 

80 

5.0200 

4.4569 

4.1275 

3.9068 

3.7465 

3.5262 

3.3796 

3.2737 

3.1594 

3.0338 

2.9512 

2.8126 

2.7247 

100 

4.9628 

4.4030 

4.0754 

3.8557 

3.6961 

3.4764 

3.3300 

3.2240 

3.1093 

2.9829 

2.8994 

2.7586 

2.6683 

200 

4.8514 

4.2982 

3.9742 

3.7567 

3.5983 

3.3798 

3.2336 

3.1275 

3.0120 

2.8838 

2.7984 

2.6520 

2.5559 

00 

4.7442 

4.1973 

3.8769 

3.6614 

3.5044 

3.2870 

3.1410 

3.0346 

2.9182 

2.7879 

2.7002 

2.5470 

2.4428 

583 


P  =  3 


3" 

25.930 

26.996 

27.665 

28.125 

28.712 

29.073 

29.316 

29.561 

29.809 

29.959 

30.19 

30.31 

4" 

1.1880 

1.1929 

1.1959 

1.1978 

1.2003 

1.2018 

1.2028 

1.2038 

1.2048 

1.2054 

1.2063 

1.2068 

5 

42.474 

41.764 

1.305 

40.983 

40.562 

40.300 

40.120 

39.937 

39.750 

39.635 

39.462 

39.366 

6 

25.456 

24.715 

24.235 

23.899 

23.458 

23.182 

22.992 

22.799 

22.600 

22.479 

22.294 

22.190 

7 

18.752 

18.056 

17.605 

17.288 

16.870 

16.608 

16.427 

16.241 

16.051 

15.934 

15.755 

15.653 

8 

15.308 

14.657 

14.233 

13.934 

13.540 

13.290 

13.118 

12.941 

12.758 

12.646 

12.473 

12.375 

10 

11.893 

11.306 

10.921 

10.649 

10.287 

10.057 

9.8974 

9.7320 

9.5603 

9.4541 

9.2897 

9.1955 

12 

10.229 

9.6825 

9.3234 

9.0680 

8.7271 

8.5088 

8.3566 

8.1982 

8.0330 

7.9301 

7.7700 

7.6777 

14 

9.2550 

8.7356 

8.3935 

8.1495 

7.8225 

7.6122 

7.4649 

7.3110 

7.1497 

7.0488 

6.8908 

6.7991 

16 

8.6180 

8.1183 

7.7884 

7.5526 

7.2355 

7.0307 

6.8868 

6.7360 

6.5772 

6.4774 

6.3204 

6.2287 

18 

8.1701 

7.6851 

7.3644 

7.1347 

6.8251 

6.6244 

6.4830 

6.3343 

6.1771 

6.0780 

5.9212 

5.8292 

20 

7.8384 

7.3649 

7.0513 

6.8263 

6.5224 

6.3249 

6.1853 

6.0383 

5.8822 

5.7834 

5.6266 

5.5341 

25 

7.2943 

6.8407 

6.5394 

6.3227 

6.0287 

5.8365 

5.7001 

5.5555 

5.4010 

5.3025 

5.1446 

5.0503 

30 

6.9654 

6.5245 

6.2311 

6.0196 

5.7319 

5.5431 

5.4085 

5.2654 

5.1116 

5.0129 

4.8535 

4.7575 

35 

6.7453 

6.3132 

6.0253 

5.8175 

5.5341 

5.3476 

5.2143 

5.0720 

4.9185 

4.8195 

4.6586 

4.5608 

40 

6.5877 

6.1621 

5.8783 

5.6732 

5.3929 

5.2081 

5.0757 

4.9340 

4.7806 

4.6813 

4.5189 

4.4195 

50 

6.3773 

5.9606 

5.6823 

5.4809 

5.2050 

5.0224 

4.8911 

4.7502 

4.5967 

4.4968 

4.3319 

4.2297 

60 

6.2433 

5.8324 

5.5577 

5.3587 

5.0856 

4.9044 

4.7739 

4.6334 

4.4798 

4.3793 

4.2123 

4.1078 

70 

6.1504 

5.7436 

5.4715 

5.2742 

5.0031 

4.8229 

4.6929 

4.5526 

4.3988 

4.2979 

4.1292 

4.0227 

80 

6.0823 

5.6786 

5.4084 

5.2122 

4.9426 

4.7632 

4.6336 

4.4935 

4.3395 

4.2381 

4.0680 

3.9600 

100 

5.9891 

5.5896 

5.3220 

5.1276 

4.8601 

4.6817 

4.5525 

4.4126 

4.2583 

4.1563 

3.9840 

3.8734 

200 

5.8099 

5.4186 

5.1562 

4.9653 

4.7017 

4.5252 

4.3970 

4.2574 

4.1023 

3.9988 

3.8212 

3.7042 

00 

5.6397 

5.2565 

4.9992 

4.8116 

4.5519 

4.3773 

4.2499 

4.1104 

3.9541 

3.8487 

3.6642 

3.5384 

( continued ) 
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Table  A.12.  ( Continued ) 


Vh 


ve 

4 

5 

6 

8 

10 

12 

15 

20 

25 

40 

60 

4“ 

49.964 

51.204 

52.054 

53.142 

53.808 

P  =  4 

54.258 

54.71 

55.17 

55.46 

5" 

1.9964 

2.0013 

2.0046 

2.0087 

2.0112 

2.0128 

2.0145 

2.0171 

2.0171 

2.019 

— 

6 

65.715 

64.999 

64.497 

63.841 

63.432 

63.151 

62.866 

62.573 

62.396 

62.13 

— 

7 

37.343 

36.629 

36.129 

35.474 

35.064 

34.782 

34.495 

34.200 

34.019 

33.75 

— 

8 

26.516 

25.868 

25.413 

24.814 

24.437 

24.178 

23.912 

23.639 

23.471 

23.214 

23.072 

10 

17.875 

17.326 

16.938 

16.424 

16.098 

15.872 

15.640 

15.399 

15.250 

15.021 

14.891 

12 

14.338 

13.848 

13.500 

13.037 

12.741 

12.535 

12.321 

12.099 

11.961 

11.747 

11.624 

14 

12.455 

12.002 

11.680 

11.248 

10.972 

10.778 

10.577 

10.366 

10.234 

10.029 

9.9103 

16 

11.295 

10.868 

10.563 

10.154 

9.8904 

9.7054 

9.5119 

9.3085 

9.1810 

8.9808 

8.8644 

18 

10.512 

10.104 

9.8121 

9.4190 

9.1647 

8.9857 

8.7978 

8.5996 

8.4748 

8.2778 

8.1626 

20 

9.9500 

9.5560 

9.2736 

8.8926 

8.6453 

8.4708 

8.2871 

8.0926 

7.9696 

7.7748 

7.6601 

25 

9.0585 

8.6884 

8.4223 

8.0616 

7.8261 

7.6590 

7.4821 

7.2933 

7.1730 

6.9805 

6.8659 

30 

8.5377 

8.1825 

7.9265 

7.5784 

7.3502 

7.1876 

7.0147 

6.8291 

6.7101 

6.5181 

6.4026 

35 

8.1968 

7.8517 

7.6026 

7.2631 

7.0397 

6.8801 

6.7099 

6.5262 

6.4079 

6.2156 

6.0989 

40 

7.9566 

7.6188 

7.3746 

7.0413 

6.8214 

6.6640 

6.4955 

6.3131 

6.1952 

6.0023 

5.8844 

50 

7.6404 

7.3125 

7.0751 

6.7501 

6.5350 

6.3804 

6.2143 

6.0334 

5.9157 

5.7214 

5.6011 

60 

7.4417 

7.1202 

6.8872 

6.5676 

6.3555 

6.2027 

6.0381 

5.8581 

5.7403 

5.5446 

5.4222 

70 

7.3054 

6.9884 

6.7584 

6.4426 

6.2325 

6.0809 

5.9173 

5.7378 

5.6200 

5.4230 

5.2987 

80 

7.2061 

6.8924 

6.6646 

6.3515 

6.1430 

5.9924 

5.8294 

5.6503 

5.5323 

5.3343 

5.2084 

100 

7.0711 

6.7619 

6.5372 

6.2279 

6.0215 

5.8721 

5.7101 

5.5313 

5.4131 

5.2133 

5.0849 

200 

6.8143 

6.5139 

6.2952 

5.9933 

5.7910 

5.6439 

5.4836 

5.3053 

5.1863 

4.9819 

4.8471 

00 

6.5741 

6.2821 

6.0692 

5.7743 

5.5758 

5.4309 

5.2721 

5.0940 

4.9737 

4.7629 

4.6190 

585 


P  =  5 


5" 

81.991 

83.352 

85.093 

86.160 

86.88 

— 

— 

— 

— 

— 

6" 

3.0093 

3.0142 

3.0204 

3.0241 

3.0266 

3.0291 

3.032 

— 

— 

— 

7 

93.762 

93.042 

92.102 

91.515 

91.113 

90.705 

90.29 

90.04 

— 

— 

8 

51.339 

50.646 

49.739 

49.170 

48.780 

48.382 

47.973 

47.723 

47.35 

— 

10 

27.667 

27.115 

26.387 

25.927 

25.610 

25.284 

24.947 

24.740 

24.422 

— 

12 

20.169 

19.701 

19.079 

18.683 

18.409 

18.124 

17.830 

17.647 

17.365 

17.20 

14 

16.643 

16.224 

15.666 

15.309 

15.059 

14.800 

14.530 

14.361 

14.100 

13.95 

16 

14.624 

14.239 

13.722 

13.389 

13.157 

12.914 

12.659 

12.499 

12.250 

12.105 

18 

13.326 

12.963 

12.476 

12.161 

11.939 

11.708 

11.463 

11.310 

11.068 

10.928 

20 

12.424 

12.078 

11.612 

11.310 

11.097 

10.874 

10.637 

10.488 

10.252 

10.113 

25 

11.046 

10.728 

10.297 

10.016 

9.8168 

9.6061 

9.3814 

9.2386 

9.0102 

8.8745 

30 

10.270 

9.9689 

9.5592 

9.2907 

9.0995 

8.8964 

8.6785 

8.5389 

8.3141 

8.1790 

35 

9.7739 

9.4836 

9.0879 

8.8277 

8.6419 

8.4437 

8.2301 

8.0926 

7.8693 

7.7339 

40 

9.4292 

9.1469 

8.7613 

8.5070 

8.3250 

8.1303 

7.9195 

7.7833 

7.5607 

7.4247 

50 

8.9825 

8.7107 

8.3385 

8.0921 

7.9150 

7.7248 

7.5177 

7.3829 

7.1605 

7.0229 

60 

8.7057 

8.4406 

8.0769 

7.8355 

7.6615 

7.4741 

7.2692 

7.1351 

6.9124 

6.7730 

70 

8.5174 

8.2570 

7.8991 

7.6612 

7.4894 

7.3039 

7.1004 

6.9667 

6.7434 

6.6024 

80 

8.3811 

8.1241 

7.7705 

7.5351 

7.3648 

7.1807 

6.9782 

6.8448 

6.6208 

6.4785 

100 

8.1969 

7.9446 

7.5969 

7.3649 

7.1968 

7.0145 

6.8133 

6.6801 

6.4550 

6.3103 

200 

7.8505 

7.6070 

7.2706 

7.0451 

6.8811 

6.7023 

6.5032 

6.3702 

6.1416 

5.9908 

00 

7.5305 

7.2955 

6.9698 

6.7505 

6.5902 

6.4144 

6.2171 

6.0838 

5.8499 

5.6899 

( continued ) 
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Table  A.12.  ( Continued ) 


Vh 


VE 

6 

8 

10 

12 

15 

20 

25 

30 

35 

10 

45.722 

44.677 

44.019 

43.567 

P  =  6 

43.103 

42.626 

42.334 

42.136 

41.993 

12 

28.959 

28.121 

27.590 

27.223 

26.843 

26.451 

26.209 

26.044 

25.925 

14 

22.321 

21.600 

21.141 

20.821 

20.489 

20.144 

19.929 

19.783 

19.677 

16 

18.858 

18.210 

17.795 

17.505 

17.202 

16.886 

16.688 

16.553 

16.455 

18 

16.755 

16.157 

15.772 

15.501 

15.218 

14.921 

14.735 

14.607 

14.513 

20 

15.351 

14.788 

14.424 

14.168 

13.899 

13.615 

13.436 

13.313 

13.223 

25 

13.293 

12.786 

12.456 

12.222 

11.975 

11.711 

11.544 

11.428 

11.343 

30 

12.180 

11.705 

11.395 

11.173 

10.939 

10.687 

10.526 

10.414 

10.331 

35 

11.484 

11.031 

10.733 

10.520 

10.293 

10.049 

9.8921 

9.7820 

9.7003 

40 

11.009 

10.571 

10.282 

10.075 

9.8535 

9.6142 

9.4596 

9.3508 

9.2699 

50 

10.402 

9.9832 

9.7060 

9.5067 

9.2927 

9.0598 

8.9082 

8.8009 

8.7207 

60 

10.031 

9.6246 

9.3547 

9.1602 

8.9507 

8.7215 

8.5717 

8.4651 

8.3851 

70 

9.7813 

9.3830 

9.1182 

8.9269 

8.7204 

8.4938 

8.3450 

8.2388 

8.1589 

80 

9.6014 

9.2093 

8.9480 

8.7591 

8.5548 

8.3300 

8.1819 

8.0759 

7.9959 

100 

9.3598 

8.9760 

8.7197 

8.5340 

8.3326 

8.1102 

7.9629 

7.8572 

7.7771 

200 

8.9099 

8.5419 

8.2950 

8.1153 

7.9193 

7.7011 

7.5552 

7.4494 

7.3685 

00 

8.4997 

8.1463 

7.9082 

7.7340 

7.5430 

7.3284 

7.1832 

7.0768 

6.9945 

"Multiply  each  entry  in  this  row  by  100. 
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Table  A.13.  Orthogonal  Polynomial  Contrasts 


Variable 


p 

Polynomial 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

c'c. 

3 

Linear 

-1 

0 

1 

2 

Quadratic 

1 

-2 

1 

6 

4 

Linear 

-3 

-1 

1 

3 

20 

Quadratic 

1 

-1 

-1 

1 

4 

Cubic 

-1 

3 

-3 

1 

20 

5 

Linear 

-2 

-1 

0 

1 

2 

10 

Quadratic 

2 

-1 

-2 

-1 

2 

14 

Cubic 

-1 

2 

0 

-2 

1 

10 

Quartic 

1 

-4 

6 

-4 

1 

70 

6 

Linear 

-5 

-3 

-1 

1 

3 

5 

70 

Quadratic 

5 

-1 

-4 

-4 

-1 

5 

84 

Cubic 

-5 

7 

4 

-4 

-7 

5 

180 

Quartic 

1 

-3 

2 

2 

-3 

1 

28 

Quintic 

-1 

5 

-10 

10 

-5 

1 

252 

7 

Linear 

-3 

-2 

-1 

0 

1 

2 

3 

28 

Quadratic 

5 

0 

-3 

-4 

-3 

0 

5 

84 

Cubic 

-1 

1 

1 

0 

-1 

-1 

1 

6 

Quartic 

3 

-7 

1 

6 

1 

-7 

3 

154 

Quintic 

-1 

4 

-5 

0 

5 

-4 

1 

84 

Sextic 

1 

-6 

15 

-20 

15 

-6 

1 

924 

8 

Linear 

-7 

-5 

-3 

-1 

1 

3 

5 

7 

168 

Quadratic 

7 

1 

-3 

-5 

-5 

-3 

1 

7 

168 

Cubic 

-7 

5 

7 

3 

-3 

-7 

-5 

7 

264 

Quartic 

7 

-13 

-3 

9 

9 

-3 

-13 

7 

616 

Quintic 

-7 

23 

-17 

-15 

15 

17 

-23 

7 

2,184 

Sextic 

1 

-5 

9 

-5 

-5 

9 

-5 

1 

264 

Septic 

-1 

7 

-21 

35 

-35 

21 

-7 

1 

3,432 

9 

Linear 

-4 

-3 

-2 

-1 

0 

1 

2 

3 

4 

60 

Quadratic 

28 

7 

-8 

-17 

-20 

-17 

-8 

7 

28 

2,772 

Cubic 

-14 

7 

13 

9 

0 

-9 

-13 

-7 

14 

990 

Quartic 

14 

-21 

-11 

9 

18 

9 

-11 

-21 

14 

2,002 

Quintic 

-4 

11 

-4 

-9 

0 

9 

4 

-11 

4 

468 

Sextic 

4 

-17 

22 

1 

-20 

1 

22 

-17 

4 

1,980 

Septic 

-1 

6 

-14 

14 

0 

-14 

14 

-6 

1 

858 

Octic 

1 

-8 

28 

-56 

70 

-56 

28 

-8 

1 

12,870 

10 

Linear 

-9 

-7 

-5 

-3 

-1 

1 

3 

5 

7 

9 

330 

Quadratic 

6 

2 

-1 

-3 

-4 

-4 

-3 

-1 

2 

6 

132 

Cubic 

-42 

14 

35 

31 

12 

-12 

-31 

-35 

-14 

42 

8,580 

Quartic 

18 

-22 

-17 

3 

18 

18 

3 

-17 

-22 

18 

2,860 

Quintic 

-6 

14 

-1 

-11 

-6 

6 

11 

1 

-14 

6 

780 

Sextic 

3 

-11 

10 

6 

-8 

-8 

6 

10 

11 

3 

660 

Septic 

-9 

47 

-86 

92 

56 

-56 

-42 

86 

-47 

9 

29,172 

Octic 

1 

-7 

20 

-28 

14 

14 

-28 

20 

-7 

1 

2,860 

Novic 

-1 

9 

-36 

84 

-126 

126 

-84 

36 

-9 

1 

48,620 

Note :  Entries  are  rows  c of  the  (p  —  1)  x  p  matrix  C  illustrated  in  (6.91)  in  Section  6.10.1. 
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Table  A.14.  Test  for  Equal  Covariance  Matrices,  a  =  .05 


V 

k=  2 

k  =  3 

k  =  4 

k  =  5 

\o 

II 

k  =  l 

?s- 

II 

00 

ON 

II 

O 

II 

3 

12.18 

18.70 

24.55 

p  =  2 

30.09  35.45 

40.68 

45.81 

50.87 

55.86 

4 

10.70 

16.65 

22.00 

27.07 

31.97 

36.75 

41.45 

46.07 

50.64 

5 

9.97 

15.63 

20.73 

25.57 

30.23 

34.79 

39.26 

43.67 

48.02 

6 

9.53 

15.02 

19.97 

24.66 

29.19 

33.61 

37.95 

42.22 

46.45 

7 

9.24 

14.62 

19.46 

24.05 

28.49 

32.83 

37.08 

41.26 

45.40 

8 

9.04 

14.33 

19.10 

23.62 

27.99 

32.26 

36.44 

40.57 

44.64 

9 

8.88 

14.11 

18.83 

23.30 

27.62 

31.84 

35.98 

40.05 

44.08 

10 

8.76 

13.94 

18.61 

23.05 

27.33 

31.51 

35.61 

39.65 

43.64 

11 

8.67 

13.81 

18.44 

22.85 

27.10 

31.25 

35.32 

39.33 

43.29 

12 

8.59 

13.70 

18.30 

22.68 

26.90 

31.03 

35.08 

39.07 

43.00 

13 

8.52 

13.60 

18.19 

22.54 

26.75 

30.85 

34.87 

38.84 

42.76 

14 

8.47 

13.53 

18.10 

22.42 

26.61 

30.70 

34.71 

38.66 

42.56 

15 

8.42 

13.46 

18.01 

22.33 

26.50 

30.57 

34.57 

38.50 

42.38 

16 

8.38 

13.40 

17.94 

22.24 

26.40 

30.45 

34.43 

38.36 

42.23 

17 

8.35 

13.35 

17.87 

22.17 

26.31 

30.35 

34.32 

38.24 

42.10 

18 

8.32 

13.30 

17.82 

22.10 

26.23 

30.27 

34.23 

38.13 

41.99 

19 

8.28 

13.26 

17.77 

22.04 

26.16 

30.19 

34.14 

38.04 

41.88 

20 

8.26 

13.23 

17.72 

21.98 

26.10 

30.12 

34.07 

37.95 

41.79 

25 

8.17 

13.10 

17.55 

21.79 

25.87 

29.86 

33.78 

37.63 

41.44 

30 

8.11 

13.01 

17.44 

21.65  25.72 

P  =  3 

29.69 

33.59 

37.42 

41.21 

4 

22.41 

35.00 

46.58 

57.68 

68.50 

79.11 

89.60 

99.94 

110.21 

5 

19.19 

30.52 

40.95 

50.95 

60.69 

70.26 

79.69 

89.03 

98.27 

6 

17.57 

28.24 

38.06 

47.49 

56.67 

65.69 

74.58 

83.39 

92.09 

7 

16.59 

26.84 

36.29 

45.37 

54.20 

62.89 

71.44 

79.90 

88.30 

8 

15.93 

25.90 

35.10 

43.93 

52.54 

60.99 

69.32 

77.57 

85.73 

9 

15.46 

25.22 

34.24 

42.90 

51.33 

59.62 

67.78 

75.86 

83.87 

10 

15.11 

24.71 

33.59 

42.11 

50.42 

58.57 

66.62 

74.58 

82.46 

11 

14.83 

24.31 

33.08 

41.50 

49.71 

57.76 

65.71 

73.57 

81.36 

12 

14.61 

23.99 

32.67 

41.00 

49.13 

57.11 

64.97 

72.75 

80.45 

13 

14.43 

23.73 

32.33 

40.60 

48.65 

56.56 

64.36 

72.09 

79.72 

14 

14.28 

23.50 

32.05 

40.26 

48.26 

56.11 

63.86 

71.53 

79.11 

15 

14.15 

23.32 

31.81 

39.97 

47.92 

55.73 

63.43 

71.05 

78.60 

16 

14.04 

23.16 

31.60 

39.72 

47.63 

55.40 

63.06 

70.64 

78.14 

17 

13.94 

23.02 

31.43 

39.50 

47.38 

55.11 

62.73 

70.27 

77.76 

18 

13.86 

22.89 

31.26 

39.31 

47.16 

54.86 

62.45 

69.97 

77.41 

19 

13.79 

22.78 

31.13 

39.15 

46.96 

54.64 

62.21 

69.69 

77.11 

20 

13.72 

22.69 

31.01 

39.00 

46.79 

54.44 

61.98 

69.45 

76.84 

25 

13.48 

22.33 

30.55 

38.44 

46.15 

53.70 

61.16 

68.54 

75.84 

30 

13.32 

22.10 

30.25 

38.09 

45.73 

53.22 

60.62 

67.94 

75.18 
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Table  A.14.  ( Continued ) 


V 

k  =  2 

il 

04 

k  =  4 

k  =  5 

k  =  6 

k  =  l 

*- 

II 

OO 

k  =  9 

O 

II 

5 

35.39 

56.10 

75.36 

93.97 

P  =  4 
112.17 

130.11 

147.81 

165.39 

182.80 

6 

30.06 

48.62 

65.90 

82.60 

98.93 

115.03 

130.94 

146.69 

162.34 

7 

27.31 

44.69 

60.89 

76.56 

91.88 

106.98 

121.90 

136.71 

151.39 

8 

25.61 

42.24 

57.77 

72.77 

87.46 

101.94 

116.23 

130.43 

144.50 

9 

24.45 

40.57 

55.62 

70.17 

84.42 

98.46 

112.32 

126.08 

139.74 

10 

23.62 

39.34 

54.04 

68.26 

82.19 

95.90 

109.46 

122.91 

136.24 

11 

22.98 

38.41 

52.84 

66.81 

80.48 

93.95 

107.27 

120.46 

133.57 

12 

22.48 

37.67 

51.90 

65.66 

79.14 

92.41 

105.54 

118.55 

131.45 

13 

22.08 

37.08 

51.13 

64.73 

78.04 

91.15 

104.12 

116.98 

129.74 

14 

21.75 

36.59 

50.50 

63.95 

77.13 

90.12 

102.97 

115.69 

128.32 

15 

21.47 

36.17 

49.97 

63.30 

76.37 

89.26 

101.99 

114.59 

127.14 

16 

21.24 

35.82 

49.51 

62.76 

75.73 

88.51 

101.14 

113.67 

126.10 

17 

21.03 

35.52 

49.12 

62.28 

75.16 

87.87 

100.42 

112.87 

125.22 

18 

20.86 

35.26 

48.78 

61.86 

74.68 

87.31 

99.80 

112.17 

124.46 

19 

20.70 

35.02 

48.47 

61.50 

74.25 

86.82 

99.25 

111.56 

123.79 

20 

20.56 

34.82 

48.21 

61.17 

73.87 

86.38 

98.75 

111.02 

123.18 

25 

20.06 

34.06 

47.23 

59.98 

72.47 

84.78 

96.95 

109.01 

120.99 

30 

19.74 

33.59 

46.61 

59.21 

71.58 

83.74 

95.79 

107.71 

119.57 

6 

51.11 

81.99 

110.92 

138.98 

P  —  J 

166.54 

193.71 

220.66 

247.37 

273.88 

7 

43.40 

71.06 

97.03 

122.22 

146.95 

171.34 

195.49 

219.47 

243.30 

8 

39.29 

65.15 

89.45 

113.03 

136.18 

159.04 

181.65 

204.14 

226.48 

9 

36.71 

61.39 

84.62 

107.17 

129.30 

151.17 

172.80 

194.27 

215.64 

10 

34.93 

58.78 

81.25 

103.06 

124.48 

145.64 

166.56 

187.37 

208.02 

11 

33.62 

56.85 

78.75 

100.02 

120.92 

141.54 

161.98 

182.24 

202.37 

12 

32.62 

55.37 

76.83 

97.68 

118.15 

138.38 

158.38 

178.23 

198.03 

13 

31.83 

54.19 

75.30 

95.82 

115.96 

135.86 

155.54 

175.10 

194.51 

14 

31.19 

53.23 

74.05 

94.29 

114.16 

133.80 

153.21 

172.49 

191.68 

15 

30.66 

52.44 

73.01 

93.02 

112.66 

132.07 

151.29 

170.36 

189.38 

16 

30.22 

51.76 

72.14 

91.94 

111.41 

130.61 

149.66 

166.53 

187.32 

17 

29.83 

51.19 

71.39 

91.03 

110.34 

129.38 

148.25 

166.99 

185.61 

18 

29.51 

50.69 

70.74 

90.23 

109.39 

128.29 

147.03 

165.65 

184.10 

19 

29.22 

50.26 

70.17 

89.54 

108.57 

127.36 

145.97 

164.45 

182.81 

20 

28.97 

49.88 

69.67 

88.93 

107.85 

126.52 

145.02 

163.38 

181.65 

25 

28.05 

48.48 

67.86 

86.70 

105.21 

123.51 

141.62 

159.60 

177.49 

30 

27.48 

47.61 

66.71 

85.29 

103.56 

121.60 

139.47 

157.22 

174.87 

Note:  Table  contains  upper  percentage  points  for 


-2 In M  =  v  In  |S|  -  ^ln|S,  | 

for  k  samples,  each  with  v  degrees  of  freedom.  Reject  Ho  :  Xi  =  X2  =  •  •  •  =  2k  if  —2  In  M  >  table 
value. 
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Table  A.15.  Test  for  Independence  of  p  Variables 


Upper  percentage  points  for 


! 

u 


In 


(  2p  +  5 

V  6 


In  |  R| , 


where  v  is  the  degrees  of  freedom  of  S  or  R.  Reject  hypothesis  of  independence  if  u'  is  greater 
than  table  value.  The  x„  values  are  shown  for  comparison,  since  u'  is  approximately  x2  dis¬ 
tributed  with  /  =  \p(p  —  1)  degrees  of  freedom. 


n 

P  =  3 

P  =  4 

P  =  5 

P  =  6 
a  =  .05 

P  =  7 

P  =  8 

P  =  9 

p=  10 

4 

8.020 

5 

7.834 

15.22 

6 

7.814 

13.47 

24.01 

7 

7.811 

13.03 

20.44 

34.30 

8 

7.811 

12.85 

19.45 

28.75 

46.05 

9 

7.811 

12.76 

19.02 

27.11 

38.41 

59.25 

10 

7.812 

12.71 

18.80 

26.37 

36.03 

49.42 

73.79 

11 

7.812 

12.68 

18.67 

25.96 

34.91 

46.22 

61.76 

89.92 

12 

7.813 

12.66 

18.58 

25.71 

34.28 

44.67 

57.68 

75.45 

13 

7.813 

12.65 

18.52 

25.55 

33.89 

43.78 

55.65 

70.43 

14 

7.813 

12.64 

18.48 

25.44 

33.63 

43.21 

54.46 

67.87 

15 

7.813 

12.63 

18.45 

25.36 

33.44 

42.82 

53.69 

66.34 

16 

7.814 

12.62 

18.43 

25.30 

33.31 

42.55 

53.15 

65.33 

17 

7.814 

12.62 

18.41 

25.25 

33.20 

42.34 

52.77 

64.63 

18 

7.814 

12.62 

18.40 

25.21 

33.12 

42.19 

52.48 

64.12 

19 

7.814 

12.61 

18.38 

25.19 

33.06 

42.06 

52.26 

63.73 

20 

7.814 

12.61 

18.37 

25.16 

33.01 

41.97 

52.08 

63.43 

7.815 

12.59 

18.31 

25.00 

32.67 

41.34 

51.00 

61.66 

a  =  .01 

4 

11.79 

5 

11.41 

21.18 

6 

11.36 

18.27 

32.16 

7 

11.34 

17.54 

26.50 

44.65 

8 

11.34 

17.24 

24.95 

36.09 

58.61 

9 

11.34 

17.10 

24.29 

33.63 

47.05 

74.01 

10 

11.34 

17.01 

23.95 

32.54 

43.59 

59.36 

90.87 

11 

11.34 

16.96 

23.75 

31.95 

42.00 

54.83 

73.03 

109.53 

12 

11.34 

16.93 

23.62 

31.60 

41.13 

52.70 

67.37 

88.05 

13 

11.34 

16.90 

23.53 

31.36 

40.59 

51.49 

64.64 

81.20 

14 

11.34 

16.89 

23.47 

31.20 

40.23 

50.73 

63.06 

77.83 

15 

11.34 

16.87 

23.42 

31.09 

39.97 

50.22 

62.05 

75.84 

16 

11.34 

16.86 

23.39 

31.00 

39.79 

49.85 

61.36 

74.56 

17 

11.34 

16.86 

23.36 

30.94 

39.65 

49.59 

60.86 

73.66 

18 

11.34 

16.85 

23.34 

30.88 

39.54 

49.38 

60.49 

73.01 

19 

11.34 

16.85 

23.32 

30.84 

39.46 

49.22 

60.21 

72.52 

20 

11.34 

16.84 

23.31 

30.81 

39.39 

49.09 

59.99 

72.15 

11.34 

16.81 

23.21 

30.58 

38.93 

48.28 

58.57 

69.92 

APPENDIX  B 


Answers  and  Hints  to  Problems 


CHAPTER  2 


2.1  (a)  A  +  B  = 


(b)  A' A  = 


2.2  (a)  (A  +  BV  = 


(b)  A'  = 


2.3  (a)  AB  = 


A  —  B  = 


4  -1 
-4  13 


AA'  = 


A'  +  B'  = 


29 

62 


62 

138 


(A')'  = 


4  2 
7  5 


BA  = 


2  6 
11  -2 


(b)  |AB|  =  -70,  |A|  =  -7 

3  3 
3  4 

(b)  tr(A)  =  0,  tr(B)  =  7 


2.4  (a)  A  +  B  = 


B|  =  10 
tr(A  +  B)  =  7 


2.5  (a)  AB  = 


BA  = 


(b)  tr(AB)  =  1,  tr(BA)  =  1 

2.6  (b)  x  =  (  1  1  -1  )' 

2.7  (a)  Bx  =  (13,  6,  9)'  (b)  y'B  =  (25,  -1,  17) 

(d)  x'Ay  =  43  (e)  x'x  =  6  (f)  x'y  =  3 

1  -1  2 
(g)xx'=|  -1  1  -2 

2-2  4 


=  A 


(c)  x'Ax  =16 


591 


592 


ANSWERS  AND  HINTS  TO  PROBLEMS 


3 

2 

1 

-3 

-2 

-1 

6 

4 

2 

62 

7 

22  ' 

7 

14 

7 

22 

7 

41 

(h)  xy'  = 


(i)  B'B  =  I  7  14  7 

\  22  7  41  / 

2.8  (a)  x  +  y  =  (4, 1,  3)',  x  -  y  =  (-2,  -3,  1)' 

(b)  (x  —  y),A(x  —  y)  =  —31 

2.9  Bx  =  bixi  +  b2*2  +  b3^3 

/  3  \  (  -2  \ 

=  (D  7  +(-l)  1  +(2)  0  I  = 


2.10  (a)  (AB)'  = 


2.11  (a)  a'b  =  5,  (a'b)2  =  25 

(  4  2  6  \ 
(b)  bb'  =  2  1  3  , 

V  6  3  9  / 


(  a 

2a 

3  a 

\ 

DA  = 

4  b 

5b 

6b 

. 

\ lc 

8c 

9c 

J 

( 

a2 

2a  b 

3  ac 

DAD  = 

1  4  ab 

5b2 

6  be 

\  lac 

8bc 

9c2 

/  8  9  5 

6  \ 

2.13  AB  = 

7  5  5 

4 

^  3  4  2 

2  7 

2.14  AB  =  ( 

<  3  5  \ 

x  1  4  7’ 

CB 

2.15  (a)  tr(A)  =  5,  tr(B)  =  5 
/  6  4  5 

(b)  A  +  B  =  2  -2  1 

\  4  9  6 

(c)  |A|  =  0,  |B|=2 

/  9  12  17  N 
3-15 
\  6  13  12  , 


B'A'  = 


a'(bb')a  =  25 


AD  = 


,  tr(A  +  B)  =  10 


(c)  |A|  =  5 


a  2b  3c 
4 a  5b  6c 
la  8  b  9c 


(d)  AB  = 


det(AB)  =  0 
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2.16  (a)  |A|  =  36 

2.17  (a)  det(A)  =  1 


/  1.7321  2.3094  1.7321  \ 

(b)  T  =  |  0.0  1.6330  1.2247 

\  0.0  0.0  2.1213  / 


(b)  T  = 


1.7321  -2.8868  -.5774  \ 

0.0  2.1602  -.7715 

0.0  0.0  .2673  ) 


/  .4082  -.5774  .7071  \ 

2.18  (a)  C  =  I  .8165  .5774  .0000 

\  .4082  -.5774  -.7071  / 

2.19  (a)  Eigenvalues:  2,  1,-1 

/  .3015  \  /  .7999  \ 

Eigenvectors:  I  .9045  1  ,  I  .5368  1  , 

\  .3015  J  \  .2684  ) 

(b)  tr(A)  =  2,  |A|  =  -2 


/  .0000  .5774  -.8165  \ 

2.20  (a)  C  =  -.7071  -.5774  -.4082 

\  .7071  -.5774  -.4082  / 


/  .7071 
° 

\  .7071 


/  -2  0  0  \ 

(b)  C'AC  =  0  10 

\  0  0  4  / 


(c)  CDC' 


3  1  1  \ 

1  0  2  =  A 

12  0/ 


2.21  Eigenvalues:  1,  3, 


/  -.7071  -.7071  \ 

^  -.7071  .7071  /’ 


A1/2  =  CD1/2C'  = 


1.3660  -.3660  \ 
-.3660  1.3660  ) 


2.22  (a)  The  spectral  decomposition  of  A  is  given  by  A  =  CDC',  where 


1 

<  .455 

-.580 

.675  \ 

c  = 

.846 

.045 

-.531  and  D  =  diag(13.542,  3.935,  —2.477) 

\ 

i  .278 

.813 

.511  / 

(b)  The  spectral  decomposition  of  A2  is  given  by  A2  =  CDC',  where  C  is  the 
same  as  in  part  (a)  and  D  =  diag(  183.378,  15.486,  6.135).  Note  that  the 
diagonal  elements  of  D  are  the  squares  of  the  diagonal  elements  of  D  in 
part  (a). 

(c)  The  spectral  decomposition  of  A-1  is  given  by  A-1  =  CDC',  where 


/ 

'  -.580 

.455 

.675 

C  = 

.045 

.846 

-.531 

\ 

,  .813 

.278 

.511 

and  D  =  diag(.254,  .074,  -  .404). 
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The  diagonal  elements  of  D  are  the  reciprocals  of  those  of  D  in  part  (a). 
The  first  two  columns  of  C  have  been  interchanged  to  match  the  inter¬ 
change  of  the  corresponding  elements  of  D;  that  is,  D  =  (1/A,2,  I  //.  i  , 

IA3). 

2.23  A  =  UDV',  where  D  =  diag(13.161,  7,  000,  3.433), 


2.24 


/  .282  -.730  .424  \ 

.591  -.146  .184 

V  ~  -.225  .404  .886  ’ 

v  .721  .531  -.040  / 

(a)  j'a  =  (l)fli  +  (l)a2  H - h 


/  .856 

Y  =  -.156 

\  .494 

=  Ei  ai  =  a  j 


-.015 

.946 

.324 


.517 

.284 

-.807 


(b)  j'A  =  [(l)an  +  (l)a2i  H - h  (l)flnt.  •  ••  ,  (l)«tP 


+  (1)112 p  +  '  '  '  +  (l)flnp] 


= 

:  (Ei  flit,  El  «i2,  •  •  •  ,  Ei  aip) 

(  (l)flll  +  (l)fll2  +  •••  +  (l)fll/)  ^ 

(  E jQij  \ 

(l)a2i  +  (l)a22  +  •  •  •  +  (1  )ci2p 

Ej  02 j 

Aj  = 

~ 

\  (l)fl«i  +  (l)fl«2  +  •••  +  (1 )anp  ) 

\  Ej  a"j  ) 

2.25  (x  -  y)'  (x  -  y)  =  (x'  -  y')  (x  -  y)  =  x'x  -  x'y  -  y'x  +  y'y 


=  x'x  -  2x'y  +  y'y 

2.26  By  (2.27),  (A'A)'  =  A' (A')'.  By  (2.6),  (A')'  =  A.  Thus,  (A'A)'  =  A'A. 

2.27  (a)  Ei  a'xi  =  a'xi  +  a'x2  H - h  a'x„ 

=  a'(X|  +  x2  -( - hx„)  [by  (2.21)] 

=  a'  E« x' 


(b)  El  Ax'  =  Ax!  +  Ax2  h - 1-  Ax„ 

=  A(xj  +  x2  H - fx„)  [by  (2.21)] 

=  A  E,- x'- 

(c)  E/(a'xi)2  =  Ei  a'Cxix/)a  [by  (2.40)] 

=  a'(Ei  x<x>  [by  (2.29)] 

(d)  Ei  A+ (Ax< )'  =  El  Ax«x'A'  =  A(Ei  Y'X'IA'  [by  (2.29)] 

2.28  (a)  Ax=  (  “))*=(  “5* 
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(b)  ASA'  =  (  )  S(a1;  a2)  =  (  j)  )  (Sa,,Sa2)  [by  (2.48)] 

_  /  a',Sai  a,  Sa2  \ 

—  \  a'9Sai  a'9Sa2  / 


2.29  (a)  If  A  = 


(  a.  \ 


V  <  7 


,  then  by  (2.68),  A'  =  (ai,  a2, . . .  ,  a„)  and 


A'A  =  (ai,a2, ...  ,a„) 


/  a!  \ 


V  <  ) 


=  aia'j  +  a2a2  H - h  a„a,',  [by  (2.66)]. 


2.30  A-1  A  =  I 

(A-1  A)'  =  I'  =  I 

A' (A"1)'  =  I 

(A')-1A'(A-1)'  =  (A')"1!  =  (A')"1 
(A"1)'  =  (A')”1 

2  31  L(  bAn+Ana^a'nAn  ~Auan  \  (  An  a12  \ 
•  b  V  -a'l2A-'  1  )  {  a'12  «22  ) 


1 

b 

1 

b 


bl  +  Ajj'aija],  —  Ajj'a^a'jj  +  Ajj'a^a'jjAj^a^  —  A1I1ai2fl22 

—  a'i2  +  a'12  — al2Ai/a12  +  a22 

bl  °  \  ,  .  ,  A_i 

q/  b  I,  where  b  =  a22  ~  a12An  ai2 


I 

0' 


0 

1 


2.32 


(B  +  cc') 


B'cc'B-1 
1  +c'B"1c 


=  I- 


cc'B 


1  +c'B“1c 


+  cc'B-1 


=  I  -  cc'B 


-i/  1  +c'B 


/U— 1 , 


\1  +  c'B-lc 


c(c'B-1c)c'B-1 
1  +c'B-1c 

+  cc'B-1  =  I 


[by  (2.26)] 
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2.33  |cA|  =  |cIA| 

=  |cI||A|  (by  (2.89)] 

=  c"|A|  (by  (2.84)] 

2.34  AA_1  =  I 
|AA-'|  =  |I| 

|A||A-'|  =  1  [by  (2.83)] 

|  A- 1 1  =  — 

|A| 

2.35  In  (2.93)  and  (2.94),  let  An  =  B,  Ajt  =  c,  A21  =  —  cf,  and  A22  =  1.  Then 
equate  the  right-hand  sides  of  (2.93)  and  (2.94)  to  obtain  (2.95). 

2.36  By  (2.52),  tr(AA')  =  Ya=\  a -a,  =  E/=i  (an  +  «,22  H - ^  af^ 

=  E?=tE5=i4 

2.37  Show  that  |C|  /  0  by  taking  the  determinant  of  both  sides  of  C'C  =  I.  Thus 
C  is  nonsingular  and  C-1  exists.  Multiply  C'C  =  I  on  the  right  by  C-1  and 
on  the  left  by  C. 

2.38  Multiply  ABx  =  Ax  on  the  left  by  B.  Then  X  is  an  eigenvalue  of  BA,  and  Bx 
is  an  eigenvector. 

2.39  (a)  (A1/2)2  =  (CD  1/2C')2  =  CD1/2C'CD1/2C' 

=  CDC'  [by  (2.101)] 

=  A  [by  (2.109)] 

(b)  By  (2.114),  A1/2 A1/1  =  A.  By  (2.89), 

lA^A1/2!  =  |A| 

|A1/2||A1/2|  =  |A| 

IA1/2!2  =  |A| 

(c)  Since  A  is  positive  definite,  we  have,  from  part  (b),  |A*^2|  =  | A| 1//2. 

CHAPTER  3 

3.1  z  =  Y!L  1  z-i ln  —  Hi  °yi/n  —  (ayi  +  •  •  •  +  ay„)/n.  Now  factor  a  out  of  the 
sum. 

3.2  The  numerator  of  s 2  is  Y!i=i^i  -  ?)2  =  Hi(a)’i  ~  «7)2  =  HM^i  ~  7)]2- 

3.3  x=4,  y  —  4: 
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x 

2 

2 

2 

4 

4 

4 

6 

6 

6 


y  x - x  y  —  y 

2  -2  -2 

4-2  0 

6-2  2 
2  0-2 
4  0  0 

6  0  2 

2  2-2 
4  2  0 

6  2  2 


(s-x)(y-y) 

4 

0 

-4 

0 

0 

0 

-4 

0 

4 


Sum  =  0 


f  Xi  ^ 

( 1  ^ 

(  X\  > 

(  X\  -  X  > 

X2 

—  X 

1 

— 

X2 

_ 

X 

— 

X2  —  X 

V  xn  / 

1 1 ) 

\  Xn  / 

\  X  ) 

\  Xn  -X  / 

3.5  y,-  -  y  = 


i=l 


n  /  (Vn  -  JO2  (y,i  -  yi)(ya  -  y2)  (yn  -  TiXyzs  -  y3)  \ 

=  m[  Cvi2-y2)(v,-i -yj)  (yi2  -  y2)2  (y.2  -y2)(y,-3  -y3)  1 

i=i  \  (y 3  -  y3)(y,i  -  yo  (v,-3  -  y3)(v,2  -  y2)  (y,3  -  y3)2  / 

3.6  z  =  J2'i=i  Zi/n  —  Hi  a'y i/n  —  (a'yj  +  •  •  •  +  a'y „)/n.  Now  factor  out  a'  on 
the  left.  See  also  (2.42). 

3.7  The  numerator  of  s*  is  H'i=i(Zi  -  z)1  =  £,-(a'y,-  ~  a'y)2  =  £;(a'y,-  - 
a'y)  (a'y,-  —  a'y).  The  scalar  a'y,-  is  equal  to  its  transpose,  as  in  (2.39).  Thus 
a'y ;  =  (a'y,)'  =  y-a,  and  (a'y,-  -  a'y) (a'y,-  -  a'y)  =  J2i  (a'y,'  ~  a'yXy^a  - 
y'a).  By  (2.22)  and  (2.24),  this  becomes  JT  a'(y,-  —  y)(y ,-  —  y)'a.  Now  factor 
out  a'  on  the  left  and  a  on  the  right.  See  also  (2.44). 

3.8  By  (3.63)  and  (3.64), 


(  a'jSai 

a',Sa2 

■  ■  a,  Sa*  > 

ASA'  = 

a^Sai 

a(,Sa2  • 

■  •  a(Sa^- 

V  a',Sai 

aJ.Sa2  • 

a^Sa*  ) 

from  which  the  result  follows  immediately. 
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3.9  cov(z)  =  covIXS1/2)  'y— (X^2)  1  ^i] 


=  (X1/2)-1  cov(y)[(21/2)_1]'  [by  (3.76)] 

=  (X1/2)-1  x1/2)-1 

=  -(S1/2)-1^1/2^1/2^1/2)-1  [by  (2.114)] 
n 

1 

=  -I 
n 

3.10  Answers  are  given  in  Examples  3.6  and  3.7. 

3.11  (a)  |S|  =  459.956  (b)  tr(S)  =  213.043 


3.12  (a)  |S|  =  27,  236,  586  (b)  tr(S)  =  292.891 


/  1.000 

.614 

.757 

.575 

.413  \ 

.614 

1.000 

.547 

.750 

.548 

3.13  R  - 

.757 

.547 

1.000 

.605 

.692 

.575 

.750 

.605 

1.000 

.524 

^  .413 

.548 

.692 

.524 

1.000  / 

3.14  z  = 

83.298, 

s2  =  1048.659 

3.15  rzw  —  —.6106 

3.16  yi  =  (1,  0,  0)y  =  a'y,  \(yi  +  y3)  =  (0,  |)y  =  by.  Use  (3.57)  to  obtain 

rzw  =  .4873. 


3.17  (a)  z  = 


(b)  R  = 


38.369  ' 

\ 

/  323.64 

19.25 

-460.98 

40.838 

,  sz 

=  19.25 

588.67 

104.07 

-51.727  / 

/ 

\  -460.98 

104.07 

686.27 

/  1.0000 

.0441 

-.9781  \ 

.0441 

1.0000 

.1637 

I  -.9781 

.1637 

1.0000  ) 

/  48.655  \ 

/  6.3300 

6.1891 

5.7770 

5.5348  \ 

49.625 

6.1891 

6.4493 

6.1534 

5.9057 

y  = 

50.570 

,  s  = 

5.7770 

6.1534 

6.9180 

6.9267 

v  51.445  ) 

v  5.5348 

5.9057 

6.9267 

7.4331  / 

/  1.0000 

.9687 

.8730 

.8069  \ 

.9687 

1.0000 

.9212 

.8530 

.8730 

.9212 

1.0000 

.9659 

v  .8069 

.8530 

.9659 

1.0000  / 

(b)  |S|  =  1.0865,  tr(S)  =  27.1304 


3.19  (a)  z  =  44.1400,  s:  =  21.2309,  uJ  =  103.8850,  s2  =  30.8161 
(b)  szw  =  6.5359,  rzw  =  .2555 
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401.40  \ 
-47.55  , 

150.48  j 
1.00  -.63 
-.63  1.00 

.96  -.62 


( 

398 

Sz  = 

-44 

V 

148 

.96  > 

-.62 

1.00  , 

33 

-44.35 

.35 

12.36 

,35 

-16.90 

148.35  \ 
-16.90  , 

59.46  / 


3.21  (a) 


y 

X 


/  185.72  \ 
I  151.12  1 


183.84 
\  149.24  / 


'  95.29 

52.87 

69.66 

46.11  ' 

52.87 

54.36 

51.31 

35.05 

69.66 

51.31 

100.81 

56.54 

,  46.11 

35.05 

56.54 

45.02  , 

3.22 


/  70.08  \ 
73.54 
75.10 

109.68 
104.24 
\  109.98 


'  95.54 

17.61 

12.18 

60.52 

23.00 

62.84 

17.61 

73.19 

14.25 

5.73 

61.28 

-1.66 

12.18 

14.25 

76.17 

46.75 

32.77 

69.84 

60.52 

5.73 

46.75 

808.63 

320.59 

227.36 

23.00 

61.28 

32.77 

320.59 

505.86 

167.35 

,  62.84 

-1.66 

69.84 

227.36 

167.35 

508.71  , 

CHAPTER  4 

4.1  |Xi|  =  1,  tr(Xi)  =  20,  |X2|=4,  tr(X2)  =  15.  Thus  tr(Xi)  >  tr(X2), 

but  |Xi  |  <  |X2|.  When  converted  to  correlations,  we  have 


.96  .80  \ 

1.00  .89 

.89  1.00  / 


.87  .41  \ 

1.00  .71 

.71  1.00  / 


As  noted  at  the  end  of  Section  4.1.3,  a  decrease  in  intercorrelations  or  an 
increase  in  the  variances  will  lead  to  a  larger  |X|.  In  this  case,  the  decrease 
in  correlations  from  Xi  to  X2  outweighed  the  increase  in  the  variances  (the 
increase  in  trace). 
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4.2  E{ z)  =  (T,)_1[£(y)  -  pi]  [by  (3.75)] 

=  cr'nV-M]  =  o, 

cov(z)  =  (T,)“12[(T,)“1]'  [by  (3.76)] 

=  (T')_1T'TT_1  [by  (2.75)  and  (2.79)] 

=  I 

4.3  By  the  last  expression  in  Section  2.3.1, 

"  1  1 
n — - —  =  — - — . 

(V2^)P |2|!/2  (V2^)^|S|"/2 

The  sum  in  the  exponent  of  (4.13)  follows  from  the  basic  algebra  of  exponents. 

4.4  Since  (y  —  pt)'2_1( y  —  pi)  is  a  scalar,  we  have  E[(y  —  /u,),2_1(y  —  pi)]  = 
£{tr[(y  -  pi)'2_1(y  -  pi)]}  =  £{tr[2_1(y  -  pi)(y  -  A*)']}  =  trtS-^Cy- 
M)(y  -  M)'l  =  tr(2_1S)  =  tr(Ip)  =  p. 

4.5  The  other  two  terms  are  of  the  form  j  j(y  —  fi)'X~l(yi  —  y),  which  is 

equal  to  j[(y-  pi)'2_1]  E"=i <3i  ~7)- This  vanishes  because  Ya=\  (y /  ~D  = 
ny  —  ny  —  0. 

4.6  We  replace  y,-  in  *Jb\  by  Zi  =  ay/  +  b.  By  an  extension  of  (3.3),  z  —  ay  +  b. 
Then  (4.18)  becomes 

V«E"=i  to  ~z)3  _  V»  E/  (fl.v/  +  fr  -  ay  -  b)3 
[E"=i  -  ?)2]3/2  tE;  (ayi  +  b- ay  -  b )2]3/2 

V««3E/(.v/ -y)3  _  VnJ2i(yi -y)3  _  rr 
~  la2Zi(yi-y)2]y2  _  [E/(y<  -y)2l3/2  ~~  v  1- 

Similarly,  if  (4.19)  is  expressed  in  terms  of  z,i  =  ayt  +  b,  it  reduces  to  ho  in 
terms  of  y, . 

4.7  fo, p  =  E[(y  -  pi)'2_1(y  -  pi)]2  by  (4.33).  But  when  y  is  Np(fi,  2),  v  = 
(y  —  pi/2_  1  (y  —  pi)  is  distributed  as  y  2 ( /? )  by  property  3  in  Section  4.2.  Then 
E(v2)  =  var(u)  +  [£(i>)]2. 

4.8  To  show  that  b\  p  and  bi.p  are  invariant  under  the  transformation  z  =  Ay,-  +  b, 
where  A  is  nonsingular,  it  is  sufficient  to  show  that  g//(z)  =  (y —  y/2-1  (y j  — 
y).  By  (3.67)  and  (3.68),  z  =  Ay  +  b  and  2.  =  A2A'.  Then  gjj  for  z  becomes 

gij( z)  =  (z/  -  zJ'X^Cz,  -  z) 

=  (Ay,-  +  b  -  Ay  -  b),(A2A,)_1  (Ay;-  +  b  -  Az  -  b) 

=  (y/  -  y)'A'(A')_I2_1A_1A(y;-  -  y) 

=  (y/  -  y)'2_1(y;-  -  y)  =  g,y(y). 
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4.9  Let  i  —  (n)  in  (4.44);  then  solve  for  Djn)  in  (4.43)  and  substitute  into  (4.44)  to 
obtain  F(„)  in  terms  of  w,  as  in  (4.45). 

4.10  (a)  a'  =  (2, -1,3),  z  =  ay  is  A(17,  21) 


(b)  A  = 


1  1 
1  -1 


z  =  Ay  is  N2 


10 


(c)  By  property  4b  in  Section  4.2,  y2  is  N(  1,  13). 


(d)  By  property  4a  in  Section  4.2, 


y  1 

,V3 


is  N2 


(e)  A  = 


4.11  (a)  z  = 


(b)  z  = 


1  0 

0  0 

1  j_ 

2  2 

.408 

-.047 

.285 

.465 

-.070 

.170 


-.070  .170  \  /  y  —  3  \ 

.326  -.166  y-l 
-.166  .692  )\y-  4  / 


-1 

9 


1  5.25  / 


(c)  By  (4.6),  (y  —  fi)'X  x(y  —  m)  is  distributed  as  /|. 
4.12  (a)  a'  =  (4,  -2,  1,  -3),  z  =  a'y  is  N(- 30,  153) 

(b)  A  =  7 


(c)  A  = 


z  =  Ay  is  A3 


(d)  By  property  4b  in  Section  4.2,  >3  is  /V ( —  1 , 2). 


(e)  By  property  4a  in  Section  4.2, 


yi 

y4 


is  N2 


27  -79  \ 
-79  361  ) 


(f)  A  = 


/  1  0  0  0  \ 
0  0 


1  1 

2  2 
1  i 

3  3 


0 


1111 


Ay  is  N4 


(  -2\ 

/ 

.5 

0 

’ 

l  1-25  ) 

V 

1.5 

2 


1.5 

2 

3.75 

\ 

1 

.67 

.875 

.67 

.67 

1 

.875 

1 

1.688 

) 
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/ 

.302 

0 

0 

0  \ 

<  y  +  2  \ 

.408 

.561 

0 

0 

y- 3 

-.087 

.261 

1.015 

0 

y  +  i 

V 

-.858 

-.343 

-.686 

.972 

v  y-5  y 

/ 

.810 

.305 

.143 

-.479  N 

x  +  2  \ 

.305 

.582 

.249 

-.083 

y-3 

.143 

.249 

1.153 

-.298 

y  +  i 

V 

-.480 

-.083 

-.298 

.787  , 

V  y  —  5 

/ 

4.13  (a)  z  = 


(b)  z  = 


(c)  (y  —  /a),2_1(y  —  M)  =  (y  —  ju,),S_1/2S_1/2(y  —  fi)  =  z'z,  which  is 
X2(P)  =  X2(4). 

4.14  The  variables  in  (b),  (c),  and  (d)  are  independent. 

4.15  The  variables  in  (a),  (c),  (d),  (f),  (i),  (j),  and  (n)  are  independent. 

4.16  (a)  £(y|x)  =  fiy  +  %yx. 


M-t) 


-.5 

.5 


5 

-2 


-2 

4 


-l 


xi 

X2  ‘ 


.25 

1.25 


Xi  - 
X2  ■ 


.25 

1.25 


Xi 

X2 


(X  -  fix) 


(b)  cov(y|x)  =  Xyy 
14  -I 
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4.18  (a)  By  the  central  limit  theorem  in  Section  4.3.2,  «Jn( y  —  /x)  is  approximately 

Np{  0,2). 

(b)  y  is  approximately  Np(fx,X/n). 

4.19  (a)  The  plots  show  almost  no  deviation  from  normality. 


(b)  Variable 

yi 

V2 

V3 

V4 

\fbi 

.3069 

.3111 

.0645 

.0637 

b2 

1.932 

2.107 

1.792 

1.570 

The  values  of  *Jb\  show  a  small  amount  of  positive  skewness,  but  none 
exceeds  the  upper  2.5%  critical  value  for  *Jb\  given  in  Table  A.l  as  .942. 
The  values  of  b2  show  negative  kurtosis.  For  V4,  the  kurtosis  is  significant, 
since  (?2  <  1.74,  the  lower  2.5  percentile  in  Table  A. 3. 


(c)  Variable 

yi 

V2 

V3 

V4 

D 

.2848 

.2841 

.2866 

.2851 

Y 

.4021 

.2934 

.6730 

.4491 

From  Table  A.4,  the  lower  2.5  percentile  for  Y  is  —3.04  and  the  upper 
97.5  percentile  is  .628.  We  reject  the  hypothesis  of  normality  only  for  3)3. 
(d)  z  defined  in  (4.24)  is  approximately  IV (0,  3 /«).  To  obtain  a  )V(0,1)  statis¬ 
tic,  we  calculate  z*  =  z/yfTpn. 


Variable 

Vi 

V2 

V3 

T4 

z* 

-.3366 

-.3095 

-.0737 

-.0856 

4.20  (a)  i 

1 

2 

3 

4  5 

6 

7 

8 

9  10 

Df 

1.06 

1.60 

7.54 

3.54  4.61 

.63 

.81 

2.47 

.95  3.78 

(b)  The  .05  critical  value  from  Table  A. 6  is  7.01. 

D2 

uao) 

=  7.54  >  7.01. 

(c)  ; 

1 

2 

3 

4  5  6 

7 

8 

9 

10 

u(i) 

.08 

.10 

.12 

.13  .20  .30 

.44 

.47 

.57 

.93 

Vi 

.07 

.13 

.18 

.23  .28  .34 

.40 

.47 

.55 

.68 

The  plot  of  (Vj,  U(, ))  shows  some  evidence  of  nonlinearity  and  an  outlier, 
(d)  bhp  =  7.255,  £>2,/?  =  14.406.  Both  (barely)  exceed  upper  .05  critical 
values  in  Table  A. 5. 


4.21  (b)  Variable 

Vi 

T2 

,V3 

T4 

.Vs 

\fb[ 

.2176 

.5857 

.7461 

-.3327 

-.1772 

b2 

2.079 

1.681 

2.583 

1.774 

2.456 

None  of  the  values  of  *Jb{  exceeds  1.134  (from  Table  A.l)  or  is  less 
than  —1.134.  None  of  the  values  of  hi  is  less  than  1.53  (from  Table  A.3). 
Thus  there  is  no  significant  departure  from  normality. 
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(c)  Variable  Vi 

y-i 

V3 

T4 

T5 

D 

.279 

.269 

.275 

.281 

.276 

Y 

-.305 

-1.399 

-.805 

-.114 

- 

.669 

(d) 

z* 

= 

z/Q'i/n,  where  z  is  defined  in  (4.24). 

Variable  33 

V3 

,V4 

V5 

z* 

|  -.4848 

-1.7183 

-1.3627  .8091 

.3686 

4.22  (a) 

i 

1  2 

3  4 

5 

6 

7 

8 

9 

10 

11 

Df 

5.20  2.15 

7.63  5.34 

5.54 

1.73 

5.21 

5.90 

2.72 

6.02 

2.56 

(c) 

i 

1  2 

3  4 

5 

6 

7 

8 

9 

10 

11 

M0') 

.19  .24 

.28  .30 

.57 

.57 

.59 

.61 

.65 

.66 

.84 

Vi 

.18  .27 

.34  .39 

.45 

.50 

.55 

.61 

.66 

.73 

.82 

The  plot  shows  a  sharp  break  from  the  fourth  to  the  fifth  points. 

(d)  bhp  =  12.985,  b2,P  =  29.072 

4.23  (a)  The  Q-Q  plots  for  y\  and  V5  show  little  departure  from  normality.  The  Q- 
Q  plots  for  V2  and  V3  show  some  evidence  of  heavier  tails  than  the  normal. 
The  Q-Q  plots  for  34  and  yg  show  some  evidence  of  positive  skewness. 


(b)  Variable 

yi 

,V2 

V3 

V4 

.Vs 

V6 

Vbf 

.5521 

.0302 

.7827 

1.4627 

.2219 

.9974 

3.160 

3.275 

2.772 

6.675 

2.176 

4.528 

Variable 

yi 

yi 

T3 

T4 

Ts 

Ts 

D 

.276 

.274 

.275 

.260 

.286 

.271 

Y 

-1.469 

-1.845 

-1.675 

-5.249 

.889 

-2.741 

(d)  Variable 

yi 

,V2 

T3 

V4 

.Vs 

ye 

z* 

-1.640 

-.062 

-2.803 

-2.961 

-.870 

-2.456 

4.24  (a)  Df  =  7.816,  3.640,  5.730, . . .  ,  6.433 

(b)  D^5l}  —  25.628.  By  extrapolation  in  Table  A.6,  the  .05  critical  value  for 
p  =  6  is  approximately  19.  Thus  we  reject  the  hypothesis  of  multivariate 
normality. 

(c)  (Vi,  u(i))  =  (.021,  .024),  (.029,  .028), . . .  ,  (.306,  .523).  The  plot  shows 
nonlinearity  for  the  last  4  points. 

(d)  b\  p  —  16.287,  b2,p  =  58.337.  By  extrapolation  to  p  =  6  in  Table  A. 5, 
both  appear  to  exceed  their  critical  values. 
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CHAPTER  5 


5.1  By  (5.6),  we  have 


(y  -  a* o)' 


5.2  From  (5.9),  we  have 


(y  -  mo)  =  (y  -  mo )' 


S  1  (y  -  mo) 


=  «(y-Mo)'S  Vy-Mo)- 


n  l  +  «2 


(yi  -  y2)'sDl  (yi  -  y2)  =  (yi  -  y 2)' 


f  (ni +n2 


5.3  By  (5.13)  and  (5.14), 


sn,'(yi  -  y2) 


=  (yi  -  y 2)' 


,/»!+«  2, 


SPi  (yi  -  y2) 


.  \  (  1  1 

-  y2)  ( — 1 — 

\n\  ri2 


SPi  (yi  -  y2). 


t2( a)  = 


[a'(yi  -  y2)]2 


mm  [(yi -y2),Spl‘(yi -y2)]2 


[(«i  +  n2)/«i«2]a,Spia  m  +  m  (y,  -  y2)'Spl1SpiSpl1(y1  -  y2) 


5.4  It  is  assumed  that  y  and  x  have  a  bivariate  normal  distribution.  Let  y ;  =  (  ) . 

Then  dj  can  be  expressed  as  d,  —  y,  —  xt  —  a'v,- ,  where  a'  =  (1,  —1).  By 
property  la  in  Section  4.2,  dj  is  N (i\  fx.  a'Xa).  Show  that  a'v  =  y  —  x,  a'Sa  = 


Sy  —  2 syx  +  s2  —  s%,  and  that  T2  =  n(a,y),(a,Sa)  1  (a'y)  is  the  square  of 
t  —  d/(sd/yfn). 

5.5  d=\  E"=t  di  =  i  E"=i  (yi  -  Xi)  =  i  E,  }’i  -  E/  Xi=y-  x, 

sd  =  7T T  E"=t  c di  -  d)2  =  ^  Ei  (y«  -  *«  -  y  +  T)2 

=  E/[(.v/  -  30  -  (*/  -  x)]2 

When  this  is  expanded,  we  obtain  sj  =  +  s2  —  2 syx. 

5.6  The  solution  is  similar  to  that  for  Problem  5.1. 

5.7  By  (5.7),  [(v  -  p  +  l)/i >p]T2v  =  Fpp,-p+l.  By  (5.29),  (v  -  q)(T2+q  - 

Tp)/(v  +  T2)  is  T2 v_p.  Replacing  p  by  q  and  v  by  v  —  p  in  (5.7),  we  see  that 

(V-p)-q  + 1,  .  Tp+q~Tp  ■  F 

(l’-p)g  p ’  V+T2  IS  rq,(v-p)-q+l- 

5.9  Under  //03,  we  have  Cfx\  —  0  and  Cm2  =  0.  Then 


E(  Cy)  =  CE(y)  =  C  E 


myi+»2y2\  «iCmi  +  «2Cm2 


«i  + 1;2 


«i  +  «2 
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Since  yx  and  y2  are  independent, 

n\X/n  i  +  n^X/n  2 
(n  1  +n2)2 


5.10  CSpiC7(m  +  n2)  is  the  sample  covariance  matrix  of  Cy.  Hence  the  equa¬ 
tion  immediately  above  (5.39)  exhibits  the  characteristic  form  of  the  T2- 
distribution. 

5.11  T2  =  .061 

5.12  (a)  T 2  =  85.3327 

(b)  t\  =  2.5039,  t2  =  .2665,  t3  =  -2.5157,  t4  =  .9510,  t5  =  .3161 

5.13  T 2  =  30.2860 

5.14  (a)  T 1  =  1.8198 

(b)  t\  —  1.1643,  t2  =  1.1006,  t3  =  .9692,  t4  =  .7299.  None  of  these  is 
significant.  In  fact,  ordinarily  they  would  not  have  been  examined  because 
the  r2-test  in  part  (a)  did  not  reject  Hq. 

5.15  T2  =  79.5510 

5.16  (a)  T 2  =  133.4873 

(b)  t\  =  3.8879,  t2  =  -3.8652,  f3  =  -5.6911,  t4  =  -5.0426 

(c)  a'  =  (.345,  -.130,  -.106,  -.143) 

(d)  T 2  =  133.4873 

(e)  R 2  =  .782975,  T2  =  133.4873 

(f)  By  (5.32),  t2(yi\y2,y3,_ y4)  =  35.9336,  t2(y2\yi,  y3,  y4)  =  5.7994, 
t2(yi\yi,  y2,  y4)  =  1.7749,  t2(y4\yi,  y2,  y3)  =  8.2592 

(g)  By  (5.29),  T2(y3,  y4\y\,  y2)  =  12.5206,  F(y3,  y4\yi,y2)  =  6.0814 

5.17  By  (5.34),  the  test  for  parallelism  gives  T 2  =  132.6863.  The  discriminant 
function  coefficient  vector  is  given  by  (5.35)  as  a'  =  (—.362,  —.223,  —.137). 

5.18  (a)  T2  =  66.6604 

(b)  0  =  -.6556,  t2  =  2.6139,  t3  =  -3.2884,  t4  =  -4.6315,  t5  =  1.8873, 
t6  =  -3.2205 

(c)  By  (5.32), 

t2(y\\y2,  y3,  >’4,  ys,  ye)  =  -0758,  r(y2|yi,  y3,  y4.  ys,  y6)  =  6.4513, 
^2(y3lyi ,  y2,  y’4,  y5,  y6)  =  6.9518,  r2(y4|yt ,  y2,  y3,  y5,  y6)  =  6.0309, 
f2(y  5lyi,  y2,  y.3,  >’4-  y6)  =  3.7052,  r2(y6lyi,  y2,  y’3,  y4,  ys)  =  6.2619. 

(d)  By  (5.29),  T2(y4,  y5,  y6|yt,  y2,  y3)  =  27.547. 
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5.19  (a)  T2  =  70.5679  (b)  T2(y5,  y6|y3,  v4)  =  13.1517 

(c)  r2(yi,  y2 lB>  >’4,  ys,  >’6)  =  8.5162 

5.20  (a)  T2  =  18.4625  (b)  a'  =  (-.057,  -.010,  -.242,  -.071) 

(c)  By  (5.32), 

t2(y\\yi,  B>  )4)  =  3.3315,  t2(y2\yi,  y3,  b)  =  -0102, 
t2(yi\yi,  yi,  b)  =  1.4823,  t2{y 4\yi,  yi,  y3)  =  -0013. 

5.21  (a)  T2  =  15.1912  (b)  a'  =  (-.036,  .048) 

(c)  h  =  -3.8371,  t2  =  -2.4362 

5.22  T 2  =  22.3238 

5.23  (a)  r2  =  206.1188 

(b)  t2(jh\d2,d3)  =  59.0020,  t2(d2\di,d3)  =  53.4507,  t2(d3\di,d2)  = 
80.9349 


CHAPTER  6 


6.1  (a)  Using  >’,■  =  yi./n.  we  have 


k  n 


£  E^o  ~  ' 3  =  £o$  -  +  >v 

i= 1  7=1  ’j 


=  'll  yfj  ~  'll  y^  11  y‘j + n  'll  yf. 

ij  i  j  i 

=  E^Ef».+» E(f/ 

IJ  l  l 


62  IE"1 1  |E|  _  |E-‘E|  |I|  1 

|E_1  ||E  +  H|  |E-l(E  +  H)[  |I  +  E-iH|  IILtd  +  */)’ 

see  Section  2.1 1.2. 

6.3  (E-IH  —  AI)a  =  0 
[(E1/2E1/2)_1H  -  XI]a  =  0 
[(E1/2)-1  (E1/2)-1H  -  XI] a  =  0 
[(E1/2)-^  -  XE‘/2]a  =  0 
[(E1/2)-^  -  XE1/2](E1/2)_1E1/2a  =  0 
[(E'/^-iHlE1/2)-1  -  XI]E1//2a  =  0 

6.4  We  need  to  show  that  (2 N  +  s  +  1 )  /  (2 m  +  s  +  1)  =  ( ve  —  p  +  s)/d.  Using 
the  definitions  N  =  \{ve  —  p  —  1),  m  =  \{\vh  —  p\  —  1),  d  =  max(/?,  vh). 
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and  s  =  min(/?,  vh ),  we  have  2N  +  s  +  1  =  2 (j)(ve  —  p  —  1)  +  s  + 
1  =  ve  —  p  —  1  +  s  +  1  =  ve  —  p  +  s.  For  the  denominator,  we  have 
2m  +  s  +  1  =  2(2)(| vh  —  p\  —  \)  +  s  +  \  =  \vh  —  p\  +  s .  Suppose  vu  >  p- 
Then  \vh  —  p\  +  s  =  vh  —  p  +  p  —  vjj  =  d.  On  the  other  hand,  if  vh  <  p, 
then  | vh  —  p\  +  s  =  p  —  vh  +  vh  =  p  —  d. 

6.5  If  p  <  vh.  we  have  s  —  p  and  \vh  —  p\  =  vh  —  p.  Then  (6.30)  becomes 


2(sN+l)U(s'>  2 

1 

^3 

1 

+ 

1 _ 1 

U(s) 

s2(2m  +  s  +  1)  p2 

w. 

)(vh  -  p-l)  +  p+l 

=  [p(VE  ~  p  ~  1)  +  2 ]U^ 

P2(vh  -p-l+p  +  l) 

=  l PiVE  —  p  —  1)  +  2]U<S) 
p2vH 

which  is  the  same  as  (6.31)  because  p  —  s. 

6.6  When  s  —  1,  we  have  V ^  =  Ai/(1  +  li),  l/1-1-1  —  X\,  A  =  1/(1  +  Ai),  and 
9  =  X\/(l  +  Ai).  Solving  the  last  of  these  for  Ai  gives  =  9/( 1  —  9),  and 
the  results  in  (6.34),  (6.35),  and  (6.36)  follow  immediately. 

6.7  With  T2  =  (m  +  «2  -  2)(7(1)  and  U(1)  =  9/(1  -  9),  we  obtain  (5.19).  We 
obtain  (5.18)  from  (5.19)  by  V(1)  =9.  A  similar  argument  leads  to  (5.16). 

6.8  (a)  Withy,-  =  y,-./«,- and  y  =  y../N,  we  obtain 

k 

H  =  ^n,(y,-.  -yj(y,-.  -yj' 

i=i 

-  -  y,.y'.  -  y..fi.  +  y..f) 

i 

i 

V'-J''-!’!.  (Elf!-)?..  V-./  ,  'AT. 

- N - + 

l  l  l 

=  YiiA-uL-uL  +  LyL 

^  m  N  N  N 

l  1 

6.9  yi  —  y  becomes 

niFi  +  n2y2  _  myt  +«2yi  -myi  -n2y2  _  n2( yx  -y2) 
yi.  ;  —  ;  —  ; 

n\+n  2  n\  +  n2  n\+n2 


\J2niyi.  y'..  -  y.^my'i.  +  yJ.. 
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The  first  term  in  the  sum  is 


n\ni  _  _  _ 

7 — - — (yt- ~  y2.)(yt.  -  yz)  • 
Oil  +n2y 


The  second  term  in  the  sum  is 


,?1”2  r(yi. -y2.)(yt. -y2.)'- 


Oil  +n2)2 


610  e=  Xl  =  SSH(ZVSSE^>  =  SSH(z) 

1  +  M  1  +  SSH(z)/SSE(z)  SSE(z)  +  SSH(z) 

6.11  From  rf  =  X, /(I +A obtain  Xj  —  r?/(l  —  rf).  Substitute  this  into  1/(1  +Xj) 
to  obtain  the  result. 

6.12  Substitute  Ap  =  /s  into  (6.50)  to  obtain  (6.26). 

6.13  When  5  =  1,  (6.51)  becomes 


U  (D 

a^  =  TTU^- 


By  (6.34),  U<]>  =  X{. 

6.14  Substitute  Alh  =  U^/(s  +  U^)  from  (6.51)  into  (6.52)  to  obtain  F2  in 
(6.31). 

6.15  To  show  cov(c,y!- )  =  cfX/n,  use  (3.74),  cov(Ay)  =  ASA',  with  A  =  c,-I. 

6.16  By  (6.9), 


H-  =  n  y^(x,.  -  z..)(z i.  -  z ..)' 

i=  I 

=  n^(Cy,-€y..)(Cy,-Cy..)' 

i 

=  n  ^[cxy,-.  ~  y--)]tc(y/.  -  y-)]' 


=  n  c 


XI1 to. _  y-)(y,-.  -  y-)' 


c' 


[by  (2.45)] 


6.17  C  is  not  square. 

6.18  E( Cy..)  =  CE(y..)  =  C£(£?=  t  ft./*) 

=  cziE<ji)/k  =  czin/k 

—  0  [by  //o3  in  (6.83)] 

cov(Cy..)  =  CXC / kn  if  there  are  no  differences  in  the  group  means,  Cjui, 
Cju,2,  . . .  ,  C/Af.  This  condition  is  assured  by  Hq\  in  (6.78). 
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6.19  For  our  purposes,  it  will  suffice  to  show  that  T 2  has  the  characteristic  form  of 
the  redistribution  in  (5.6). 

6.20  If  X  —  ct2I,  (6.89)  becomes 

_  [tr(cr2I  -  Jg2I /p)]2  _  [cr2  tr(I  —  J/ p)]2 

( p  -  1)  tr(er2I  -  Jcr2I/ p)2  a4(p  -  1)  tr(I  -  J / p)2  ' 

Show  that  (I  —  J  Ip)2  —  I  —  J  Ip.  Then 

°\P-P/P )2  (P-D2  x 

°4(P  ~  1)0  -  P/P)  (P~  l)2 

6.21  The  (univariate)  expected  mean  square  corresponding  to  JZ.  in  a  one-way 
ANOVA  is  a 2  +  /V // 2 .  Thus  the  mean  square  for  JL.  is  tested  with  MSE.  The 
corresponding  multivariate  test  therefore  uses  H*  and  E. 

6.22  From  (6.105)  we  have 

IAEA' |  _  |AEA' | 

~~  |A(E  +  H*)A'|  ~~  |AEA'  +  AH*A'| ' 

Substitute  H*  =  kny..y' ..  to  obtain 

A= _ IAEA' | _ 

IAEA'  +  V^«Ay..(v/^nAy..),| 

Now  use  (2.95)  with  B  =  AEA'  and  c  =  \fknXy ..  to  obtain 

1 

A  = - 7 — 7 - . 

1  +  kn  (Ay. .)'  (AEA ) _  1  (Ay . .) 

Multiply  and  divide  by  ve  and  use  (6.101)  to  obtain  (6.106). 

6.23  Solve  for  T2  in  (6.106). 

6.24  In  CjB'  the  rows  of  Cj  are  multiplied  by  the  rows  of  B.  Show  that  C  i  B'  =  O. 

6.25  As  noted,  the  function  (y—AP)'S~l(y~Ap)  is  similar  to  SSE  =  (y— X/3/(y— 
Xfi)  in  (10.4)  and  (10.6).  By  an  argument  similar  to  that  used  in  Section  10.2.2 
to  obtain  ji  —  (X'X)- 1  X'y,  it  follows  that  fi  =  (A'S-1  A)-1  A'S_1y.  An  alter¬ 
native  approach  (for  those  familiar  with  differentiation  with  respect  to  a  vector) 
is  to  expand  (y  —  ApYS~l(y  —  A/3)  to  four  terms,  differentiate  with  respect 
to  j3,  and  set  the  result  equal  to  0. 

6.26  Expand  n(y  —  A/j)'S~ 1  (y  —  A/3)  to  four  terms  and  substitute 

ji  =  (A'S-1  A)-1  A'S-1y 


into  the  last  one. 
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6.27  (a)  E  = 


13.41 

7.72 

8.68 

5.86 

7.72 

8.48 

7.53 

6.21 

8.68 

7.53 

11.61 

7.04 

5.86 

6.21 

7.04 

10.57 

H  = 


1.05 

2.17 

-1.38 

-.76 

2.17 

4.88 

-2.37 

-1.26 

-1.38 

-2.37 

2.38 

1.38 

-.76 

-1.26 

1.38 

.81 

A  =  .224,  Vw  =  .860,  U{s)  =  3.08,  and  9  =  .747.  All  four  are  signifi¬ 
cant. 

(b)  r)\  =  1  — A  =  .776 ,ril  =  6  =  .747 ,Aa  =  l-A1/*  =  .526,  ALH  =  .606, 
AP  =  V^/s  =  .430 

(c)  The  eigenvalues  of  E_1H  are  2.9515  and  .1273.  The  essential  dimension¬ 
ality  of  the  space  of  the  mean  vectors  is  1 . 

(d)  For  1,  2  vs.  3  we  have  A  =  .270,  V(s)  =  .730,  t/(s)  =  2.702,  and  9  = 

.730.  All  four  are  significant.  For  1  vs.  2  we  obtain  A  =  .726,  V(s)  — 
.274,  =  .377,  and  0  =  .274.  All  four  are  significant. 

(e)  Variable  y,  y2  >’3  V4 

F  1.29  9.50  3.39  1.27 


The  F’s  for  y2  and  b  are  significant.  For  the  discriminant  func¬ 
tion  z  =  a'v,  where  a  is  the  first  eigenvector  of  E-1H,  we  have  a'  = 

(  —  .032,  —.820,  .533,  .208).  Again  yj  and  _V3  contribute  most  to  separa¬ 
tion  of  groups. 

(f)  By  (6.127),  A(y3,  y4 |y1;  y2)  =  A(yi,  y2,  y3,  y4)/Myi,  yi)  =  .224/.568  = 
.395  <  A  05  =  .725. 

(g)  By  (6.128), 


A(yi|y2,  y3,  y4)  —  A(vi,  yi,  y3,  y4)/My2,  y3,  y*) 

=  .224/. 240  =  .934  >  A.05  =  .819, 
A(v2|yi,  y3,  y4)  =  .224/.538  =  .417  <  .819, 
A(.v3|yi,  y2,  y4)  =  .224/.369  =  .609  <  .819, 
A(y4|yi,  >'2,  B)  =  .224/. 243  =  .924  >  .819. 


6.28  (a)  S  effect:  A  =  .00065,  V(i)  =  2.357,  U(s)  =  142.304,  9  =  .993.  All  are 
significant. 

V  effect:  A  =  .065,  V(s>  =  1.107,  U(s)  =  11.675,0  =  .920.  All  are 
significant. 

SV  interaction:  A  =  .138,  —  1.321,  U ^  =  3.450,  9  —  .726.  All  are 

significant. 
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(b)  Contrast  on  V  comparing  2  vs.  1,  3:  A  =  .0804,  V^s)  =  .920,  U(s)  = 
11.445,  0  —  .920.  All  are  significant. 

(c)  Linear  contrast  for  S:  A  —  .0073,  V(S)  =  .993,  t/(s)  =  135.273,  0  = 
.993.  All  are  significant. 

Quadratic  contrast  for  S:  A  =  .168,  =  .832,  =  4.956,0  =  .832. 

All  are  significant. 

Cubic  contrast  for  S:  A  —  .325,  V(s)  =  .675,  t/(i)  =  2.076,  0  =  .675. 
All  are  significant. 

(d)  The  ANOVA  F’s  for  each  variable  are  as  follows: 

Source  Vi  y2  T3  ,V4 

5  980.21  214.24  876.13  73.91 

V  251.22  9.47  14.77  27.12 

SV  20.37  2.84  3.44  2.08 


All  F’ s  are  significant  except  the  last  one,  2.08. 

(e)  Test  of  significance  of  yj  and  y4  adjusted  for  y\  and  yi'. 


S 

V 

SV 

A  Cy3,  y4\yi,yi) 

.1226 

.9336 

.6402 

Test  of  significance  of  each  variable  ad; 

5 

V 

A(yi|y2,  T3.  .V4) 

.1158 

.2099 

.3082 

A(y2|yi,  y3,  v4) 

.5586 

.8134 

.7967 

A(y3|vi,  y2,  y4) 

.2271 

.9627 

.7604 

A(y4|yi,  y2,y3) 

.6692 

.9795 

.8683 

6.29  V  =  velocity  (fixed),  L  =  lubricant  (random). 

V  effect  (using  H vl  for  error  matrix):  A  =  .0492,  V 1  =  .951,  U fv  1  = 
19.315,  0  =  .951.  With  p  =  2,vh  =  1,  and  vE  =  3,  A.05  =  .050,  = 

.950,  U q j  =  T2q5/ve  =  19.00,  0.05  =  .950.  Thus  all  four  test  statistics  are 
significant. 

L  effect  (using  E  for  error  matrix):  A  =  .692,  =  .314,  U(s}  =  .438, 

0  =  .295.  None  is  significant. 

VL  interaction  (using  E  for  error  matrix):  A  =  .932,  V fv>  =  .069,  Lls>  = 
.073,  0  =  .061.  None  is  significant. 


Source 

A 

y« 

j/W 

e 

Significant? 

(a)  Reagent 

.0993 

1.126 

6.911 

.868 

Yes 

(b)  Contrast  1  vs.  2,  3,  4 

.146 

.854 

5.871 

.854 

Yes 

Subjects 

.00000082 

2.847 

1091.127 

.999 

Yes 
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6.31  P  =  proportion  of  filler,  T  —  surface  treatment,  F  —  filler: 


Source 

A 

u w 

9 

Significant? 

P 

.138 

.977 

5.441 

.841 

Yes 

T 

.080 

.920 

11.503 

.920 

Yes 

PT 

.712 

.295 

.396 

.271 

No 

F 

.019 

.980 

51.180 

.981 

Yes 

PF 

.179 

.958 

3.835 

.784 

Yes 

TF 

.355 

.645 

1.815 

.645 

Yes 

PTF 

.752 

.264 

.309 

.172 

No 

6.32  A  =  period;  P,  T,  and  F  are  defined  in  Problem  6.31: 


Source  A  V<f>  t/w  9  Significant? 


A 

.021 

.979 

47.099 

.979 

Yes 

AP 

.475 

.545 

1.063 

.505 

No 

AT 

.142 

.858 

6.049 

.858 

Yes 

APT 

.111 

.228 

.282 

.208 

No 

AF 

.095 

.905 

9.486 

.905 

Yes 

APF 

.622 

.387 

.594 

.363 

No 

ATF 

.387 

.613 

1.586 

.613 

Yes 

APTF 

.781 

.229 

.267 

.169 

No 

For  the  between-subject  factors  and  interactions,  we  have 

Source 

df 

F 

p  -Value 

P 

2 

21.79 

<  .0001 

T 

1 

78.34 

<  .0001 

PT 

2 

1.28 

.3143 

F 

1 

345.04 

<  .0001 

PF 

2 

15.79 

.0004 

TF 

1 

5.36 

.0392 

PTF 

2 

.48 

.6294 

Error 

12 

6.33  For  parallelism,  we  use  (6.79)  to  obtain  A  =  .2397 .  For  levels,  we  use  (6.81) 
and  (6.82)  to  obtain  A  =  .9651  and  F  —  .597.  For  flatness  we  use  (6.84)  to 
obtain  T 2  =  110.521. 

6.34  (a)  By  (6.90),  T2  =  20.7420.  By  (6.105)  or  (6.106),  A  =  .5655. 

(b)  For  each  row  c'  of  C,  we  use  T2  =  wfc'yj'ic'Sc, )_ 1  c'v,  as  in  Exam¬ 
ple  6.9.2:  Tf  =  17.0648,  T, 22  =  .3238,  T2  =  .2714.  This  can  also  be  done 
by  Wilks’  A  using  A;  =  c'Ec,/c'(E  +  H*)c/ :  Ai  =  .6127,  A2  =  .9882, 
A3  =  .9900. 

6.35  The  six  variables  represent  two  within-subjects  factors:  y\  is  A\B\,  y2  is  A  \  Ih, 
y3  i s  ,4 1  /?3 ,  x\  is  A  2  /i  i ,  x2  is  At  IF,  and  x3  is  A2Z?2.  Using  linear  and  quadratic 
effects  (other  orthogonal  contrasts  could  be  used),  the  matrices  A,  B,  and  G  in 
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(6.98),  (6.99),  and  (6.100)  become 


A  =  ( 

1 

1  1 

-1 

-1  - 

■1  ), 

B=( 

1 

0 

-1 

1  0 

-1 

1 

-2 

1 

1  -2 

1 

c-( 

1 

0 

-1 

-1  0 

1 

1 

-2 

1 

-1  2 

-1 

Using  these  in  T 2  as  given  by  (6.101),  (6.102),  and  (6.103),  we  obtain  T\  = 
193.0901,  T2  =  2.8000,  and  T2B  =  6.8676.  Using  MANOVA  tests  for  the 
same  within-subjects  factors,  we  obtain 


Source 

A 

U w 

e 

Significant? 

A 

.202 

.798 

3.941 

.798 

Yes 

B 

.946 

.054 

.057 

.054 

No 

AB 

.877 

.123 

.140 

.123 

Yes 

6.36  MANOVA  tests  for  the  within-subjects  effect  T  (time),  and  interactions  of  time 
with  the  between-subjects  effects  C  (cancer)  and  G  (gender): 


Source 

A 

yW 

U(s) 

e 

T 

.258 

.742 

2.874 

.742 

TC 

.363 

.809 

1.299 

.444 

TG 

.929 

.071 

.077 

.071 

TCG 

.809 

.201 

.225 

.130 

ANOVA  /-’-tests  for  between-subjects  factors  and  interactions: 

Source  df  F  p-Value 

C  5  4.16  .003 

G  1  2.69  .107 

CG  5  .37  .869 

6.37  (a)  T2  =  79.551 

(b)  Using  tj  —  c'y/J^Sa /n ,  where  cj  is  the  ith  row  of  C,  we  obtain  t\  = 
7.155,  t2  =  -.445,  /3  =  --105. 

6.38  (a)  T2  =  1712.2201 

(b)  Using  t i  —  d-y/J^SciJn,  we  obtain  ?i  =  332.358,  ti  =  54.589,  U  = 
.056,  t4  =  7.637,  t5  =  4.344,  r6  =  1.968. 

6.39  (a)  Using  T2  =  A(€yJ'(CSpiC,)“l (CyJ  in  (6.122),  we  obtain  T2  = 

17.582  <T2539  =  27.202. 

(b)  t\  —  .951,  t2  —  1.606,  ?3  =  .127  [Since  the  7’2-test  in  part  (a)  did  not 
reject  Hq,  these  would  ordinarily  not  be  calculated.] 
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(c)  Using  A  =  |CEC'|/|C(E  +  H)C'|  in  (6.124),  we  obtain  A  =  .3107  > 
A.05,3,2,9  =  -203. 

(d)  To  compare  groups  using  each  row  of  C,  we  use  A,-  =  c'-Ec; / c,  (E  +  H)c, 
to  obtain  Ai  =  .833,  A2  =  .988,  A3  =  .650.  [Since  the  A-test  in  part  (c) 
did  not  reject  Ho,  we  would  ordinarily  not  have  calculated  these.] 

6.40  (a)  Using  T2  =  A(CyJ'(CSpiC')_I(CyJ  in  (6.122),  we  obtain  T2  = 

33.802  >  T25A  24  =  12.983. 

(b)  Using  tf  =  A(c-y)2/c'SpiC/,  we  obtain  tj  =  .675,  t2  =  .393,  t2  = 
32.626.  Only  the  cubic  effect  is  significant. 

(c)  For  an  overall  test  comparing  groups,  we  use  (6.124), 

A  =  |CEC'|/|C(E  +  H)C'|  =  .4361. 

(d)  To  compare  groups  using  each  row  of  C,  we  use 

A i  =  cjECi/cJCE  +  H)c,  :  A,  =  .534,  A2  =  .764,  A3  =  .941. 

6.41  (a)  Using  T2  —  N(Cy  )'(CSpiC,)_1(Cy  )  in  (6.122),  we  obtain  T2  = 

45.500. 

(b)  Using  tf  =  (V(c-y)2/c-SpiCi,  we  obtain  r2  =  18.410,  tf  =  8.385,  tf  — 
3.446,  t 2  =  .011,  t2  =  .098,  t2  =  2.900. 

(c)  For  an  overall  test  comparing  groups,  we  use  (6.124), 

A  =  |CEC'|/|C(E  +  H)C'|  =  .304. 

(d)  To  compare  groups  using  each  row  of  C,  we  use 

Ai  —  c'.Eci/c.(E  +  H)c;  :  A\  —  .695,  A2  =  .925,  A3  =  .731, 

A4  =  .814,  A5  =  .950,  A6  =  .894. 

6.42  (a)  Combined  groups  (pooled  covariance  matrix).  Using  t  —  number  of  min¬ 

utes  —30,  we  obtain,  by  (6.115), 

J3'  =  (98.1,  .981,  .0418,  -.00101,  -.000048). 

By  (6.116),  we  obtain  T2  —  .216.  By  (6.118),  we  have 

jx'  =  (95.5,  96.7,  95.6,  93.8,  98.1,  99.2). 

(b)  Group  1:  ji[  =  (100.7,  .819,  .040, -.00085, -.000038),  T2  =  .0113, 
jj.\  =  (105.2,  104.4,  101.5,  98.6,  100.6,  108.1) 

(c)  Groups  2-4:  jEF,  =  (97.4,  1.010,  .0403, -.00103, -.000049),  T2  = 
.2554,  jx'2  =  (92.6,  94.4,  93.8,  92.4,  97.4,  96.6) 
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6.43  (a)  For  the  control  group,  the  overall  test  is 

T2  =  «i(Cy1_)/(CSiC/)_1(Cy1_)  =  554.749. 

For  each  row  of  C  (linear,  quadratic,  etc.),  we  have 

t2  =  njlcJyj  )2/C;SiC,- :  t\  —  5.714, 
t\  =  50.111,  tj  =  50.767,  t2  =  8.011,  =  .508. 

(b)  For  the  obese  group,  we  obtain  7' 2  =  «2(Cy2  /(CStC')  1  (Cy2  )  = 
128.552.  For  the  five  rows  of  C,  we  obtain  t2  =  4.978,  t2  —  107.129, 
t2  =  5.225,  t2  =  10.750,  f2  =  3.572. 

(c)  For  the  combined  groups  (Spi  =  pooled  covariance  matrix),  we  use 
T 2  =  N(Cy J'(CSpiC')_1(Cy  )  in  (6.122)  to  obtain  T2  =  247.0079. 
We  test  for  linear,  quadratic,  etc.,  trends  using  the  rows  of  C  in  tf  — 
iV(c;yJ2/c'SpiC,-:  t2  =  1.162,  t2  =  155.017,  t2  =  30.540,  t2  =  1.319, 
t2  —  .506.  To  compare  groups,  we  use  A  =  ICEC'l/ICfE  +  H)C'| 
in  (6.124)  and  A;  =  c{Ec,-/cJ.(E  +  H)cf:  A  =  .4902,  Aj  =  .7947, 
A2  =  .9940,  A3  =  .7987,  A4  =  .6228,  A5  =  .9172. 

6.44  Control  group:  By  (6.1 15), 

=  (3.129,  .656,  -.283,  -.334,  .192,  .037,  -.020). 

By  (6. 116),  T 2  =  .7633.  By  (6.118), 


ji [  =  (An,  A12,  •  •  •  ,  Aib)  =  (4.11, 3.29,  2.71,  2.71,  3.04,  3.39,  3.54,  3.95). 

Obese  group:  /T,  =  (3.207, -.187,  .463,  .056, -.102, -.010,  .010),  T2  = 
.3943,  jx'2  =  (4.51, 4.12,  3.81,  3.48,  3.24,  3.37,  3.70,  4.02) 

Combined  groups  (pooled  covariance  matrix):  (i1  =  (3.15,  .162,  .183,  —.115, 
.012,  .010,  -.002),  T 2  =  .0158,  [x!  =  (4.36,  3.80,  3.36,  3.15,  3.13,  3.37,  3.63, 
3.98) 

6.45  A  =  activator,  T  —  time,  C  =  group,  in  (6.101),  (6.102),  and  (6.103),  we  use 
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—  5072.579,  Tf  —  268.185,  T\t  =  143.491.  The  same  within-sample 
factors  and  interaction  can  be  tested  with  Wilks’  A  using  (6. 105)  and  the  other 
three  MANOVA  tests: 


Source 

A 

y« 

t/(s) 

e 

Significant? 

A 

.003 

.997 

317.04 

.997 

Yes 

T 

.056 

.944 

16.76 

.944 

Yes 

AT 

.100 

.900 

8.97 

.900 

Yes 

The  interactions  of  the  within  factors  with  the  between  factor  G  are  tested 
with  Wilks’  A  (Section  6.9.5)  and  with  the  other  three  MANOVA  tests: 


Source 

A 

yw 

t/(s) 

9 

Significant? 

AC 

.884 

.116 

.131 

.116 

No 

TC 

.889 

.111 

.125 

.111 

No 

ATC 

.795 

.205 

.258 

.205 

No 

The  between-subjects  factor  C  is  tested  with  an  ANOVA  f'-test:  F  —  .47, 
p-value  =  .504. 

CHAPTER  7 

7.1  If  So  is  substituted  for  S  in  (7. 1 ),  we  have 

u  =  v[ln  |2ol  -  In  |2ol  +  tr(I)  -  p]  —  i>[0  +  p  -  p]  =  0. 

7.2  In  | Sol  -  In  |S|  =  -  In  |X0|_I  -  In  |S| 

=  —  In  |Sq  1 1  —  In  |S|  [by  (2.91)] 

=  —(In  |S|  +  In  |Sg  1 1) 

=  -ln|SSo1|  [by  (2.89)] 

7.3  -  In  (nf=1  A,)  +  Ef=1  =  -  £f=  i  In  A,  +  £f=1  A;  =  £f=1  (A/  -  In  A,-) 

7.4  As  noted  in  Section  7.1,  the  likelihood  ratio  in  this  case  involves  the  ratio  of 
the  determinants  of  the  sample  covariance  matrices  under  Hq  and  H \ .  Under 
H\ ,  which  is  essentially  unrestricted,  the  maximum  likelihood  estimate  of  S 
(corrected  for  bias)  is  given  by  (4.12)  as  S.  Under  Hq  it  is  assumed  that  each 
of  the  p  yi  ’s  in  y  has  variance  cr2  and  that  all  y,  ’s  are  independent.  Thus  we 
estimate  a2  (unbiasedly)  in  each  of  the  p  columns  of  the  Y  matrix  [see  (3.17) 
and  (3.23)]  and  pool  the  p  estimates  to  obtain 

£.2  _  (y>7  —  yj)~ 

uh{n~l)p' 
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Show  that  by  (3.22)  and  (3.23)  this  is  equal  to 


j.2  _  SJL  —  tr^) 

U  p  p 


Thus  the  likelihood  ratio  is 


LR  = 


|S| 

l<r2i| 

Show  that  by  (2.85)  this  becomes 

LR  = 


n/2 


JS| _ \ 


n/2 


|Itr(S)/p|/ 


|S|  \ 


n/2 


(trS /p)Pj 


7.5  If  =  A  2  =  •  •  •  =  kp  —  X,  say,  then  by  (7.5), 


pp  nr=i  ^  pp^p  , 

“  (Ef=i  Xi)P  (PW  ' 


(  1  -P  o  . 

..  0  \ 

<  P 

P  ■  ■ 

..  p  > 

7.6  [(l-p)l  +  pj]  = 

0  1  -  p  . 

0 

+ 

P 

p  . . 

..  p 

K  0  0 

(  1  p  ... 

■  ■  1  -p ) 

V  P 

p  . . 

..  p  ) 

— 

p  1  ...  p 

\  p  p  1  ) 

7.7  (a)  Substitute  J  =  jj' 

'  and  x  =  j  into  Jx  = 

=  ax  to  obtain  jj'j 

=  /.j,  which  gives 

Pi  =  ki¬ 
th)  So  =  s2[(  1  -  r )I  +  rj]  =  s2(  l  -  r)  ( I  + 
(c)  By  (2.85)  and  (2.108),  we  have 


1  -r 


I  Sol  = 


s  (1  —  r)  1 1  + 


1  -r 


=  (s2y(\-ry 


i  + 


1  —  r 


=  (s2y(  i  -  ry  fju  +  \i)  =  csV  a  -  ,-y  (l  + 

=  Cy2)p(l  -  r)p-\  1  -  r  +  rp)  =  (s2y ( 1  -  r)p~l[  1  +  (p-  l)r]. 
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isi ri/2is2r/2  •  •  •  \sk\vk/2  isiri/2is2r2/2  •  •  •  is*r/2 


7  8  M _  _ 

ISI^i1’1'/2  |S|vi/2|S|v2/2  •  •  •  |S|^/2 

7.9  (a)  M  =  .7015  (b)  M  =  .0797 


7.10  A  = 


ISyyNS**! 


|S.V.t||Syy  ~  Sy.ySjS.yy  | 
I  Syy  |  |  SAA-  | 


=  IS”1  I  I  Syy  -  SyXS~%y\  [by  (2.91)] 

=  I  Syy'  (Syy  -  Sy^S-1SXy)|  [by  (2.89)] 

=  |I-S-1Sy,S-1S,y| 

=  riLi  0  -  'f)  [by  (2.108)], 

where  the  rf’s  are  the  nonzero  eigenvalues  of  S~y  S  wS^1  SA-y.  It  was  shown  in 
Section  2.11.2  that  1  —  A;  is  an  eigenvalue  of  I  —  A,  where  A,  is  an  eigenvalue 
of  A. 


7.11  When  all  pi  —  1,  we  have  k  —  p,  and  the  submatrices  in  the  denominators  of 

(7.33)  and  (7.34)  reduce  to  S jj  =  Sjj,  j  =  1,2,...  ,  p,  and  R jj  =  1,  j  =  1, 
2  ,p- 

7.12  When  all  pt  —  1,  we  have  k  —  p  and 


p 

^  22  3 

ai-  =  P~  ~  2.^  Pf  =  P  ~  P’  «3  =  P  ~  P , 

i'=l 

c=  1  -  -^7-(2a3  +3a2) 

12/ v 

=  1  -  -^-—\2(p3  -p)  +  3 (p2  -  p)] 

6(/?z  —  p)v 

=  1  -  77  1  n  [2 ip2  -  1)  +  3 (p  -  1)] 

6 (p  -  l)v 

=  1  -  77  1  77  [2 (p  -  Dip  +  1)  +  3 (p  -  1)] 

6 (p  -  l)v 

=  l--^[2p  +  5]. 

6v 

7.13  As  noted  below  (7.6),  the  degrees  of  freedom  for  the  / 2 -approximation  is  the 
total  number  of  parameters  minus  the  number  estimated  under  Hq.  The  number 
of  distinct  parameters  in  2  is  p+{%)  —  bp(P  +  !)•  The  number  of  parameters 
estimated  under  Hq  is  p.  The  difference  is  \p(p  +  1)  —  P  —  i P(P  ~  1). 

7.14  By  (7.1)  and  (7.2),  u  =  1 1.094  and  u!  =  10.668. 

7.15  By  (7.7),  u  =  .0000594.  By  (7.9),  u'  =  23.519.  For  H0:  C2C'  =  cr2I,  u  = 
.471  and  u'  =  2.050. 

7.16  For  H0:  2  =  cr2I,  u  =  .00513  and  u  =  131.922.  For  H0:  C2C  =  a2I, 
u  —  .129  and  u'  =  36.278. 


620 


ANSWERS  AND  HINTS  TO  PROBLEMS 


7.17  For  H0:  X  =  a2\ ,  u  =  .00471  and  u'  =  136.190.  For  H0:  CXC  =  ct2I, 
u  =  .747  and  //  =  7.486. 

7.18  By  (7.16),  u'  =  6.3323  with  13  degrees  of  freedom.  The  ^-approximation  is 
F  =  .4802  with  13  and  1147  degrees  of  freedom. 

7.19  u'  =  21.488,  F  —  2.511  with  8  and  217  degrees  of  freedom 

7.20  u'  =  35.795,  F  —  4.466  with  8  and  4905  degrees  of  freedom 

7.21  u  =  8.7457,  F  =  .8730  with  10  and  6502  degrees  of  freedom 

7.22  |Si|  =  2.620  x  1014,  |S2|  =  2.410  x  1014,  |Spi|  =  4.368  x  1014,  u  =  17.502, 
F  =  .829 

7.23  In  M  =  -85.965,  u  =  156.434,  a\  =  21,  a2  =  17,797,  F  =  7.4396 

7.24  In  M  =  -7.082,  u  =  10.565,  a\  =  10,  a2  =  1340,  F  =  1.046 

7.25  In  M  =  -8.6062,  u  =  14.222,  a\  =  20,  ci2  =  3909,  F  =  .707 

7.26  In  M  =  -28.917,  u  =  44.018,  a\  =  50,  a2  =  3238,  F  =  .8625 

7.27  In  M  =  -142.435,  u  =  174.285,  a\  =  110,  a2  =  2084,  F  =  1.448 

7.28  |S|  =  1,207,  109.5,  |Syy|  =  2385.1,  |SXX|  =  1341.9,  A  =  .3772 

7.29  |S|  =  4.237  x  1013,  |Syv|  =  484,  926.6,  |SXX|  =  131,  406,  938,  A  =  .6650 

7.30  |S|  =  9.676  x  10“8,  |Svy|  =  .02097,  \SXX\  =  9.94  x  10“6,  A  =  .4642 

7.31  |S|  =  1.7148  x  1016,  |Sn|  =  11,  284.967,  |S22|  =  11,891.15,  |S33|  = 
25,951.605,  544  =  22,227.158,  s55  =  214.06,  u  =  .00103,  u'  =  274.787, 
v  =  46 

7.32  |S|  =  459.96,  sn  =  140.54,  522  =  72.25,  533  =  .250,  u  =  .1811,  u'  = 
12.246,  /  =  3 

7.33  u  =  .0001379,  u'  =  16.297 

7.34  u  =  .0005176,  u'  =  127.367 

7.35  u  =  .005071,  u'  =  131.226 


CHAPTER  8 

8.1  Using  a  =  1  (v |  —  y2),  we  obtain 

[a'(y!  -y2)]2  =  [gi  ~ ~  y2)l2 
a'Spia  (yx  -  y2),S“11SpiSp11(y1  -  y2) 

_  [(Ji  -y2),S“1‘(yi  -y2)]2 

(yt -y2)'S“11(y1 -y2) 

8.2  You  may  wish  to  use  the  following  steps: 

(i)  In  Section  5.6.2  the  grouping  variable  w  is  defined  as  n2/(n  i  +  n2)  for 
each  observation  in  group  1  and  —n\/(n\  +  n2)  for  group  2.  Show  that 
with  this  formulation,  w  —  0. 
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(ii)  Because  w  —  0,  there  is  no  intercept,  and  the  fitted  model  becomes 

u>i  —  b\{yn  -  yo  +  b2(yn  -y2)-\ - h  bp(yip  -  yp), 

i  =  1,2,...  ,  n\  +  n2. 

Denote  the  resulting  matrix  of  y  values  corrected  for  their  means  as 
Yc  and  the  vector  of  id’s  as  w.  Then  the  least  squares  estimate  b  = 
{b\ ,  b2,  ■  ■  ■  ,  bp)'  is  obtained  as 

b  =  (Y,cYc)_1Y'cw. 


Using  (2.51),  show  that 


Y^YC  = 


2  nj 

Y  Y(y‘j  ~  y)Cyy  -  y)' 


1  =  1  j=l 


2 

=  Y  Y(y‘j  ~  y«-)^yy  _  ft)'  + 

f  =  l  7  =  1 


n  i  n  2 
n  i  +  n2 


(yt  -y2)(yt  -y2)'> 


where  y  =  («iyi  +  n2y2)/(n\  +  hi).  It  will  be  helpful  to  write  die  first 
sum  above  as 


n\  n  2 

Y(y]j  ~  y)(yu  -  y)'  +  Y(yiJ  ~  fttai 

7=1  7=1 


-y)' 


and  add  and  subtract  yj  in  the  first  term  and  y2  in  the  second, 
(iii)  Show  that 


y>=EX>7 
i=  1  7=1 


■  y)Wy  =  — — — (yt  -y2). 
n  2  +  H2 


Again  it  will  be  helpful  to  sum  separately  over  the  two  groups, 
(iv)  From  steps  (ii)  and  (iii)  we  have 

b  =  (vS  +  fcdd'r^d, 


where  S  =  V -  y,- ) (y/ y  -  y,-)7(»i  +  «2  -  2),  v  —  n\  +  n2  -  2, 
k  =  n\n2/(ii\  +  h 2 ) ,  and  d  =  jq  —  y2.  Use  (2.77)  for  the  inverse  of  a 
patterned  matrix  of  the  type  vS  +  kd  d  to  obtain  (8.4). 
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8.3  You  may  want  to  use  the  following  steps: 

(i)  R2  is  defined  as  [see  (10.30)] 

2  b'Y' w  —  nw2 

R~  =  - £ - =5-. 

w'w  —  nw~ 

In  this  case  the  expression  simplifies  because  w  =  0.  Using  Y'.w  in  Prob¬ 
lem  8.2(iii),  show  that  R 2  =  b'fy |  —  y2). 

(ii)  Show  that 


b'(yi  -  y2)  = 


T2 

n  i  +  n2  —  2  +  T2 


8.4  [a'(yx  -  y2)]2  =  a'(y  -  y2)a'(yx  -  y2)  =  a'(yx  -  y2)(yx  -  y2)'a 

8.5  Ha  —  AEa  =  0 

E_1(Ha  —  AEa)  =  E'O 
E_1Ha  —  AE_1Ea  =  0 
(E_1H  —  kl)a  =  0 

8.6  Substituting  a*  =  srar,  r  =  1,2,...  ,  p,  into  (8.15),  we  obtain 


ym-yii  .  Yi/2  —  yi2  .  ,  yup-yip 

Zii  =  S\a\ - h  S2C12 - 1 - h  spa„ - 

Si  s2  sp 

=  aiyin  +  a2y M2  H - H  apy uP  -  aiJu  ~  a2yu - aPl\P 

=  fltym  +  a2ym  H - h  aPyuP  -  a'yx 


8.7  (a)  a*'  =  (1.366,  -.810,  2.525,  -1.463) 

(b)  h  =  5.417,  t2  =  2.007,  t3  =  7.775,  t4  =  .688 

(c)  The  standardized  coefficients  rank  the  variables  in  the  order  y3,  y4,  y\ ,  y2. 
The  t -tests  rank  them  in  the  order  >’3,  >’i ,  y2,  y4- 

(d)  The  partial  F’ s  calculated  by  (8.26)  are  F{y\\y2,  y3,  y4)  =  7.844, 
^(y2|yt,  y3,  yt)  =  2.612,  F{y3 lyi,  y2,  y4)  =  40.513,  and 

E(y4|yi,  y2,  y3)  =  9.938. 

8.8  (a)  a' =  (.345, -.130, -.106, -.143) 

(b)  a*'  =  (4.137,  -2.501,  -1.158,  -2.068) 

(c)  h  =  3.888,  t2  =  -3.865,  t3  =  -5.691,  U  =  -5.043 

(e)  F(y i\y2,  y3,  y4)  =  35.934,  F(y2|yi,  y3,  y4)  =  5.799, 

F{y3\y\,  y2,  y4)  =  1.775,  F{y4\y\,  y2,  y3)  =  8.259 

8.9  (a)  a'  =  (-.145,  .052,  -.005,  -.089,  -.007,  -.022) 

(b)  a*'  =  (-1.016,  .147,  -  .542,  -1.035,  -  .107,  -1.200) 


ANSWERS  AND  HINTS  TO  PROBLEMS 


623 


(c)  ti  =  -4.655,  t2  =  .592,  r3  =  -4.354,  t4  =  -5.257,  t5  =  -4.032, 
t6  =  -6.439 

(e)  F(yi\y2,  y3,  >’4,  >’5,  ye)  =  8.081,  F(y2|yi,  y3,  y4,  ys,  ye)  =  -150, 
F(y3\yi,  y2,  >’4,  vs,  ye)  =  .835,  F(y4 |yi,  y2,  y3,  y5,  y6)  =  8.503, 
^(yslyt,  )’2,  y3,  y4,  >’6)  =  -028,  F(y6 |yi,  >'2,  y3,  y4,  y5)  =  9.192 

8.10  (a)  a'  =  (.057,  .010,  .242,  .071) 

(b)  a*'  =  (1.390,  .083,  1.025,  .032) 

(c)  t,  =  -3.713,  t2  =  .549,  f3  =  -3.262,  t4  =  --724 

(e)  F(yi|y2,  y3,  y4)  =  3.332,  F(y2|yi,  y3,  y4)  =  .010, 

yi,  >4)  =  1-482,  F(y4|yx,  y2,  y3)  =  .001 

8.11  (a)  a'j  =  (.021,  .533,  -.347,  -.135),  a'2  =  (-.317,  .298,  .243,  -.026) 

(b)  X\/{X\  +  X2)  =  .958,  X2/(X\  +  X2)  =  .042.  Using  the  methods  of  Sec¬ 
tion  8.6.2,  we  have  two  tests,  the  first  for  significance  of  X  i  and  X2  and  the 
second  for  significance  of  X2 : 

Test  A  F  p-Value  for  F 

1  .2245  8.3294  <.0001 

2  .8871  1.3157  .2869 

(c)  af  =  (.076,  1.553,  -1.182,  -.439),  a*'  =  (-1.162,  .869,  .828,  -.085) 

(d)  F(y i|y2,  y3,  y4)  =  1.067,  F(y2|yi,  y3,  y4)  =  20.975, 

F{y3\yi,  y2,  }’4)  =  9.630,  F(y4\yl,y2,  y3)  =  1.228 

(e)  In  the  plot,  the  first  discriminant  function  separates  groups  1  and  2  from 
group  3,  but  the  second  is  ineffective  in  separating  group  1  from  group  2. 


Xi 

Xi/T,UXJ 

Eigenvector 

1.8757 

.6421 

a;  =  (.470,  -.263,  .653,  -.074) 

.7907 

.2707 

a'2  =  (.176,  .188,  -1.058,  1.778) 

.2290 

.0784 

=  (-.155,  .258,  .470, -.850) 

.0260 

.0089 

a'4  =  (-3.614.  .475,  .310,  -.479) 

(b)  Test  of  significance  of  each  eigenvalue  and  those  that  follow  it: 


Test  A  Approximate  F  p-Value  for  F 


1 

.1540 

4.937 

<.0001 

2 

.4429 

3.188 

.0006 

3 

.7931 

1.680 

.1363 

4 

.9747 

.545 

.5839 

(c)  af  =  (.266,  -.915,  1.353,  -.097),  af  =  (.100,  .654,  -2.291, 2.333) 
af  =  (-.087,  .899,  .973,  -1.115),  af  =  (-2.044,  1.654,  .643,  -.628) 

(d)  F{y \\y2,  y3,  y4)  =  .299,  F(y2 |yi,  y3,  y4)  =  1.931, 

F(y$\yi,  y2,  y4)  =  6.085,  F(y4|yt,  y2,  y3)  =  4.659 
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(e)  In  the  plot,  the  first  discriminant  function  separates  groups  1,  4,  and  6 
from  groups  2,  3,  and  5.  The  second  function  achieves  some  separation  of 
group  6  from  groups  1  and  4  and  some  separation  of  group  3  from  groups 
2  and  5. 

8.13  Three  variables  entered  the  model  in  the  stepwise  selection.  The  summary  table 
is  as  follows: 


Step 

Variable 

Entered 

Overall  A 

p-Value 

Partial  A 

Partial  F 

p- Value 

1 

V4 

.4086 

<.0001 

.4086 

12.158 

<.0001 

2 

ys 

.2655 

<.0001 

.6499 

4.418 

.0026 

3  y2 

8.14  Summary  table: 

Variable 

.1599 

<.0001 

.6022 

5.284 

.0008 

Step 

Entered 

Overall  A 

p-Value 

Partial  A 

Partial  F 

p- Value 

1 

y4 

.6392 

<.0001 

.6392 

21.451 

<.0001 

2 

ys 

.5430 

<.0001 

.8495 

6.554 

.0147 

3 

ye 

.4594 

<.0001 

.8461 

6.549 

.0148 

4 

y2 

.4063 

<.0001 

.8843 

4.578 

.0394 

5 

ys 

.3639 

<.0001 

.8957 

3.959 

.0547 

In  this  case,  the  fifth  variable  to  enter,  y$ ,  would  not  ordinarily  be  included 
in  the  subset.  The  p-value  of  .0547  is  large  in  this  setting,  where  several  tests 
are  run  at  each  step  and  the  variable  with  smallest  /;- value  is  selected. 


8.15  Summary  table: 

Step 

Variable 

Entered 

Overall  A 

p-Value 

Partial  A 

Partial  F 

p- Value 

1 

y2 

.6347 

.0006 

.6347 

9.495 

.0006 

2 

ys 

.2606 

<.0001 

.4106 

22.975 

<.0001 

CHAPTER  9 

9.1  zi-Z2  =  a'yi  -  a'y2  =  a'(y,  -  y2)  =  (y,  -  y2),S“1(y1  -  y2) 

9.2  j(zi+zi)  =  jCa'yj  +  a'y2)  =  +y2)  =  4(yr  -  +  y2) 

9.3  Write  (9.7)  in  the  form 

/(y|Gi)  P2 

f(y\G2)  >  pi 

and  substitute  /(y|G,)  =  N  P(pi ,  X)  from  (4.2)  to  obtain 

f(y\c2)  pi' 
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Substitute  estimates  for  /xi ,  pio,  and  X,  and  take  the  logarithm  of  both  sides  to 
obtain  (9.8).  Note  that  if  a  >  b,  then  In  a  >  In  b. 

9.4  Maximizing  p,/(y,  G()  is  equivalent  to  maximizing  ln[/j,  /'(y|C/)].  Use 
/(y|G,)  =  Npi/JLj,  X)  from  (4.2)  and  take  the  logarithm  to  obtain 

HPif(y\Gi)]  =  In  Pi  -  ^p\n{2jx)  -  i|2|  -  A(y  -  Hi)'X~\y  -  w). 


Expand  the  last  term,  delete  terms  common  to  all  groups  (terms  that  do  not 
involve  i),  and  substitute  estimators  of  pi,  and  X  to  obtain  (9. 1 1 ). 

9.5  Use  f(y\Gj)  =  Np(fii,Xi )  in  ln[p,-/(y|G;)],  delete  -(p/2)  ln(27r),  and  sub¬ 
stitute  y,  and  S;  for  pi,  and  Xj . 


9.6  (a)  a'  =  (y,  -  y2)'S  =  (.345,  -.130,  -.106,  -.143), 


Uz\  +z2)  =  -15.8054 


Number  of 
Observations 


Predicted  Group 


Error  rate  =  ^  =  .0256 

(c)  Using  the  k  nearest  neighbor  method  with  k  =  5,  we  obtain  the  same 
classification  table  as  in  part  (b).  With  k  —  4,  two  observations  are  mis- 
classified,  and  the  error  rate  becomes  2/39  =  .0513. 

9.7  (a)  a'  =  (y,  -  y2)'S“11  =  (-.145,  .052,  -.005,  -.089,  -.007,  -.022), 

i(zt  +Z2)  =  -17.045 


Linear  Classification 

Number  of 
Observations 


Predicted  Group 


Error  rate  =  (2  +  8)/73  =  .1370 
(c)  pi  and  p2  Proportional  to  Sample  Sizes 


Number  of 
Observations 


Predicted  Group 


Error  rate  =  (2  +  8)/73  =  .1370 


626 


ANSWERS  AND  HINTS  TO  PROBLEMS 


9.8  (a)  a'  =  (y,  =  (-.057,  -.010,  -.242,  -.071), 

j(zi  +zi)  =  -7.9686 


(b) 


Linear  Classification 


Actual 

Group 

1 

2 


Predicted  Group 
Number  of  _ 

Observations  1  2 


9  8  1 

10  1  9 


Error  rate  =  -fir  =  .1053 


(c) 


Holdout  Method 


Actual 

Group 

1 

2 


Predicted  Group 
Number  of  _ 

Observations  1  2 


9  6  3 

10  3  7 


Error  rate  =  (3  +  3)/19  =  .3158 
(d)  Kernel  Density  Estimator  with  h  =  2 


Actual 

Group 

1 

2 


Number  of 
Observations 

9 

10 


Predicted  Group 
1  2 


9  0 

1  9 


Error  rate  =  =  .0526 

9.9  (a)  _ 


Actual  Number  of 

Group  Observations 


1  20 
2  20 


Predicted  Group 

1  2 


18  2 
2  18 


Error  rate  =  (2  +  2)  /40  =  .  100 

(b)  Four  variables  were  selected  by  the  stepwise  discriminant  analysis:  y2,  >’3, 
j4,  and  V(,  (see  Problem  8.14).  With  these  four  variables  we  obtain  the 
classification  table  in  part  (c). 
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(c)  - 

Actual 

Group 

1 

2 


Predicted  Group 
Number  of  _ 

Observations  1  2 


20  18  2 
20  2  18 


Error  rate  =  (2  +  2) /40  =  .100.  The  four  variables  classified  the  sam¬ 
ple  as  well  as  did  all  six  variables  in  part  (a). 


9.10  (a) 


(b) 


(c) 


(d) 


By  (9.10),  L,-(y)  =  yJSp/y  -  jy'Sp/y,  =  <y  +  c0i.  The  vectors  Q‘), 
i  =  1,  2,  3,  are 


Group  1  Group  2  Group  3 


-72.77 

-65.18 

-68.57 

.81 

2.12 

.68 

15.15 

10.11 

2.79 

-1.03 

-.24 

6.54 

10.02 

11.06 

13.09 

Linear  Classification 

Actual 

Number  of  _ 

Predicted  Group 

Group 

Observations 

1 

2  3 

1 

12 

9 

3  0 

2 

12 

3 

7  2 

3 

12 

0 

1  11 

Error  rate 

=  (3  +  3  +  2+  l)/36 

=  .250 

Quadratic  Classification 

Actual 

Number  of  _ 

Predicted  Group 

Group 

Observations 

1 

2  3 

1 

12 

10 

2  0 

2 

12 

2 

8  2 

3 

12 

0 

1  11 

Error  rate 

=  (2  +  2  +  2+  l)/36 

=  .194 

Linear  Classification-Holdout  Method 

Actual 

Number  of  _ 

Predicted  Group 

Group 

Observations 

1 

2  3 

1 

12 

7 

5  0 

2 

12 

4 

5  3 

3 

12 

0 

1  11 

Error  rate  =  (5  +  4  +  3  +  l)/36  =  .361 
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(e)  k  Nearest  Neighbor  with  k  =  5 


Actual 

Group 

1 

2 

3 


Number  of 
Observations 

11 

11 

12 


Predicted  Group 

1  2  3 

9  2  0 

2  7  2 

0  1  11 


Error  rate  =  (2  +  2  +  2  +  l)/34  =  .206 

9.11  (a)  By  (9.10),  L,-(y)  =  y'S^'y  -  jy-Sp/y,-  =  cj.y  +  c0i.  The  vectors  (c(Pi), 
i  =  1,2,...  ,6,  are 


(b) 


(c) 


Group  1 

Group  2 

Group  3 

Group  4 

Group  5 

Group  6 

-300.0 

-353.2 

-328.5 

— 

291.8 

-347.5 

-315.8 

314.6 

317.1 

324.6 

307.3 

316.8 

311.3 

-59.4 

-64.0 

-65.2 

-59.4 

-65.8 

-63.1 

149.6 

168.2 

154.9 

147.7 

168.2 

160.6 

-161.2 

-172.6 

-150.4 

- 

153.4 

-172.9 

-175.5 

Linear  Classification 

Actual 

Number  of 

Predicted  Group 

Group 

Observations 

1 

2 

3 

4 

5 

6 

1 

8 

5 

0 

0 

1 

0 

2 

2 

8 

0 

3 

2 

1 

2 

0 

3 

8 

0 

0 

6 

1 

1 

0 

4 

8 

3 

0 

1 

4 

0 

0 

5 

8 

0 

3 

1 

0 

3 

1 

6 

8 

2 

0 

0 

0 

2 

4 

Correct  classification  rate  =  (5  +  3  +  6  +  4  +  3  +  4)  /48  = 

.521 

Error  rate  = 

1  -  .521  = 

.479 

Quadratic  Classification 

Actual 

Number  of 

Predicted  Group 

Group 

Observations 

1 

2 

3 

4 

5 

6 

1 

8 

8 

0 

0 

0 

0 

0 

2 

8 

0 

7 

0 

1 

0 

0 

3 

8 

1 

0 

6 

0 

1 

0 

4 

8 

0 

0 

1 

7 

0 

0 

5 

8 

0 

3 

0 

0 

4 

1 

6 

8 

2 

0 

0 

0 

1 

5 

Correct  classification  rate  =  (8  +  7  +  6  +  7  +  4  +  5) /48  =  .77 1 
Error  rate  =  —.771  =  .229 
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(d)  k  Nearest  Neighbor  with  k  =  3 


Actual 

Number  of 

Predicted  Group 

Group 

Observations 

1 

2 

3 

4  5 

6 

Ties 

1 

8 

5 

0 

0 

2  0 

0 

1 

2 

8 

0 

4 

0 

0  1 

0 

3 

3 

8 

1 

0 

6 

0  1 

0 

0 

4 

8 

0 

0 

0 

5  0 

0 

3 

5 

8 

0 

1 

0 

0  6 

1 

0 

6 

8 

2 

0 

0 

0  0 

5 

1 

Correct  classification  rate  = 

=  (5  +  4  +  6  +  5  +  6  +  5)/40  = 

.775 

Error  rate 

=  1  -  .775  =  .225 

Normal  Kernel  with  h  —  1 

(For  this  data  set,  larger  values  of  h  do  much  worse.) 

Actual 

Number  of 

Predicted  Group 

Group 

Observations 

1 

2  : 

1  4 

5 

6 

1 

8 

8 

0  0  0 

0 

0 

2 

8 

0 

8  0  0 

0 

0 

3 

8 

1 

0  6  0 

1 

0 

4 

8 

1 

0  0  7 

0 

0 

5 

8 

0 

0  0  0 

7 

1 

6 

8 

2 

0  0  0 

0 

6 

Correct  classification  rate  =  (8  +  8  +  6  +  7  +  7  +  6)/48  =  .875 
Error  rate  =  1  —  .875  =  .125 


CHAPTER  10 


By  (2.33),  E"=1  (yi  -  x/£)2  =  (y  -  Xj8)'(y  -  Xj8). 


10-2  £"=  i  (yt —  m)2  —  E"=i  (>’<■  -  y + y  -  m)2 

=  £"=i  (y/  -  y)2  +  2  E"=t  (yt  -  y)(y  -  m)  +  E”=i  (y  -  m)2 
=  E"=i  (y«  -  y)2  +  (y  -  m)  E"=i  (y>  -  y)  +  «(y  -  m)2 
=  E/(yi  -  y)2  +  »(y  -  M)2  [since  E”=i  ^  -  y)  =  0] 
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10.3  Yll=\(xn  ~ *2)7  =  7 Y!i=\  (*('2  - *2)  =  7(51"=  1  *<2  -  nx 2)  =  7(«*2  -  «*2) 

10.4  E[y,  -  E{yj )]2  =  E[yt  -  E(yt)  +  £(y,)  -  E(yt)]2 

=  E[yi  -  E(yi)]2  +  2 E[y,  -  £'(y,-)][£,(y!)  -  £(y;)] 

+  E[E(yi)  -  E(yt )]2 

The  second  term  on  the  right  vanishes  because  [ E(y ,)  —  £(y,-)]  is  constant  and 
£[}’,■  —  E(yt)]  =  E(>v)  —  £(>’,)  =  0.  For  the  third  term,  we  have  ET-E^y,)  — 
£(y/)]2  =  [£(y/)  -  E(yi)]2,  because  [£(7/)  -  £(y;)]2  is  constant. 

10.5  First  show  that  co v(/3p)  =  1 .  This  can  be  done  by  noting  that 

/3p  =  (X'pXp)-lX'p y  =  Ay,  say.  Then,  by  (3.74),  cov(Ay)  =  Acov(y)A'  = 
A(a2I)A'  =  a2 AA\  By  substituting  A  =  (X^Xp)_1X^,  this  becomes 

cov()8 p)  =  a2(X,pXp)“1.  Then,  by  (3.70),  var WpiPp)  =  x'pj  cov(Pp)xpi 
and  the  remaining  steps  follow  as  indicated. 

10.6  By  (10.36),  s2  =  SSE p/(n  -  p).  Then  by  (10.44), 


CP  —  p  +  (n  -  p) 


2  2 
sP~sk 


=  p  +  {n-  p)\-E-  1 


sp  SSE  p/st. 

=  P  +  {n  ~  p)~ |  -  («  ~  P)  =  (n  -  p) - n  +  2/7 

st  n  -  p 


SSE„ 


-  (n  -  2/7). 


10.7  (Y  -  XB)'(Y  -  XB)  =  Y'Y  -  Y'XB  -  B'X'Y  -  B'X'XB.  Transpose  B  = 
(X'Xr'X'Y  from  (10.46)  and  substitute  into  B'X'XB. 

10.8  E[ ^  -  E(ji)]\ji  -  EM  =  Eft  -  E(y,)  +  E{ y,-)  -  £(y,)][y;  -  E( y,-) 

+£ (y,-)  -  £(y,)]' 

=  Eft  -  £(y,-)][y;  -  E(y,)]' 

+E[y,  -  £(y,)][£(y,)  -  £(y,-)]' 

+E[E(yt)  -  £(y,)][y;  -  £(y,)]' 

+E[E(ji)  -  £(y,)][£(y,)  -  EM 

The  second  and  third  terms  are  equal  to  O  because  [E(y,)  —  E( y,-)]  is  a 
constant  vector  and  E[ y,-  —  E( y,-)]  =  E(yt)  —  E( y,-)  =  0.  The  fourth  term  is  a 
constant  matrix  and  the  first  E  can  be  deleted. 

10.9  As  in  Problem  10.5,  we  have  co v(/3p(/))  =  a jj(X!pXp)~^ ,  where  ajj  = 
var (y/)  is  the  jth  diagonal  element  of  X  =  cov(y).  Similarly,  cov(j}p(j), 
Pp(k))  =  o jk(X'pXp)-x ,  where  a jk  =  co v(y/,  y/)  is  the  (/A:)th  element  of  X- 
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The  notation  cov(/3 p(j),  0p(k))  indicates  a  matrix  containing  the  covariance 
of  each  element  of  0P(j)  and  each  element  of  0p(k)-  Now  for  the  covariance 
matrix,  cov(y'  )  =  cov(x' pi0p(i), . . .  ,  xC0p(m)),  we  need  the  variance  of  each 
of  the  m  random  variables  and  the  covariance  of  each  pair.  By  Problem  10.5 
and  (3.70),  var(xU/3/)(l))  =  x'pj  cov(0pa))xpi  =  (Tnx'pi(X'pXp)-lxpi .  Sim¬ 
ilarly,  cov(x'pj0p{l),x'pj0p(2))  =  (Ti2x'pi(X'pXp)-1xpi.  The  other  variances 
and  covariances  can  be  obtained  in  an  analogous  manner. 

10.10  By  (10.77),  Sp  =  Ep/(n  -  p).  Then  by  (10.83), 

C p  =  pi  +  (n  -  p) SjT1  (Sp  -  S k) 

=  pi  +  (n  -  /7)S^]  — - (n  -  p)l 

n  —  p 

=  S^Ep  +  (2 p  -  n)  1. 

10.11  |E^T1E/,  |  =  lE^'llEpI  >  0,  because  both  E^ 1  and  Ep  are  positive  definite. 

10.12  By  (10.84),  Cp  —  S^'Ep  +  (2 p  —  n) I.  Using  S k  —  E k/(n  —  k),  we  obtain 

(n  -  k)EllEp  =  Cp  +  (n  -  2p)l. 


10.13  If  C p  is  replaced  by  pi  in  (10.86),  we  obtain 


Er!En  = 


Cp  +  (n  —  2p)l  pi  +  nl  —  2  pi 


(«  ~  P)1 
n  —  k 


10.14  (a)  B  =  f 


.6264 

.0009 

-.0010 


\  .0015 


83.243  \ 
.029 
-.013 
-.004  / 


(b)  A  =  .724,  V(s)  =  .280,  U(s)  =  .375,  6  =  .264 

(c)  =  .3594,  7.2  =  .0160.  The  essential  rank  of  Bj  is  1,  and  the  power 

ranking  is  0  >  >  A  >  V^s\ 

(d)  The  Wilks’  A  test  of  a'2  adjusted  for  a'i  and  A3,  for  example,  is  given  by 
(10.65)  as 


A(a2|xi,X3) 


A(xi  ,  x2,  X3) 
A(xi,  X3) 


which  is  distributed  as  Ap ,i,„-4  and  has  an  exact  E-transformation.  The 
tests  for  xi  and  X3  are  similar.  For  the  three  tests  we  obtain  the  following: 
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A 

F 

p- Value 

Xi\x2,x3 

.931 

1.519 

.231 

X2\Xl,X3 

.887 

2.606 

.086 

X3  \Xi ,  x2 

.762 

6.417 

.004 

( 

34.282 

35.802  \ 

B  — 

.394 

.245 

l 

.529 

.471  / 

(b)  A  =  .377,  Vw  =  .625,  U(s)  =  1.647,  0  =  .622 

(c)  /,  |  =  1.644,  7.2  =  .0029.  The  essential  rank  of  Bi  is  1,  and  the  power 

ranking  is  6  >  >  A  >  V^K 

(d)  A  F  /(-Value 


xi  \x2  .888  1.327  .287 

x2\x,  .875  1.506  .245 


10.16  (a) 


B  = 


/  54.870 
.054 
-.024 


\  .107 


65.679 

-.048 

.163 

-.036 


58.106  \ 
.018 
.012 
.125 


(b)  A  =  .665,  Vw  =  .365,  U(s)  =  .458,  <9  =  .240 

(c)  A/  =  .3159,  A 2  =  .1385,  A3  =  .0037.  The  essential  rank  of  B/  is  2,  and 

the  power  ranking  is  >  A  >  >  6. 


A 

F 

p-Value 

xi\x2,x3 

.942 

.903 

.447 

X2\Xl,X3 

.847 

2.653 

.060 

X3  \  X\ ,  x2 

.829 

3.020 

.040 

A 

F 

p- Value 

Tl|V2,T3 

.890 

1.804 

.160 

T2 1  Vi ,  y3 

.833 

2.932 

.044 

T3l.v1.T2 

.872 

2.159 

.106 

10.17  (a) 


( 


B  = 


-4.140 

1.103 

.231 

1.171 

.111 

.617 

.267 


4.935  \ 
-.955 
-.222 
1.773 
.048 
-.058 
.485 


-.263  -.209 
\  -.004  -.004  j 

Test  of  overall  regression  of  (vi ,  >’2)  on  (x\  ,  as):  A  =  .4642  (with 

p  —  2,  exact  F  =  1.169,  /(-value  =  .332).  Tests  on  subsets  (the  F’s  are 
exact  because  p  =  2): 
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A 

F 

p  -Value 

(b)  x7,x8|xi,x2, ...  ,x6 

.856 

.808 

.527 

(c)  X4,X5,  X6 \Xi ,  X2 ,  X3 ,  X7 ,  Xg 

.674 

1.457 

.218 

(d)  Xl,  X2,  X3|X4,  X5,  ...  ,  X8 

.569 

2.170 

.066 

10.18  (a)  The  overall  test  of  (vi ,  yi)  on  (xi,  x7,  . . .  ,  xg)  gives  A  =  .4642,  with 
(exact)  F  =  1.169  ( /;- value  =  .332).  Even  though  this  test  result  is  not 
significant,  we  give  the  results  of  a  backward  elimination  for  illustrative 
purposes: 


Step 

Xi 

Partial  A -Test  on  Each 

X2  X3  X4 

Xi  Using  (10.72) 

x5  x6 

*7 

Xg 

1 

.723 

.969 

.817 

.859 

.821 

.945 

.924 

.943 

2 

.741 

.801 

.851 

.839 

.948 

.927 

.940 

3 

.737 

.837 

.798 

.757 

.949 

.938 

4 

.675 

.852 

.821 

.794 

.925 

5 

.680 

.861 

.835 

.817 

6 

.701 

.805 

.806 

7 

.855 

.930 

8 

.891 

At  each  step,  the  variable  deleted  was  not  significant.  In  fact,  the  variable 
remaining  at  the  last  step,  x\,  is  not  a  significant  predictor  of  y\  and  yi • 
(b)  There  were  no  significant  x’s,  but  to  illustrate,  we  will  use  the  three  x’s  at 
step  6  and  test  for  each  y: 

A  F  p-Value 


y\\yi 

yi\y\ 


.701 

.808 


3.548 

1.984 


10.19  (a) 


(b)  B  = 


/  43.703 

46.793 

187.923  \ 

.019 

-.098 

1.016 

B 

.139 

.185 

-4.953 

v  .204 

.107 

1.606  ) 

\  =  . 

167,  yw 

=  .883,  U(s)  =  4.709, 

/  99.817 

-29.120 

121.595  \ 

-.008 

-.224 

-.027 

.097 

1.252 

5.775 

B  = 

-.049 

-.442 

-1.768 

-.022 

-.631 

-.488 

-.159 

2.128 

4.387 

v  .054 

-.037 

-.476 

/ 

A  =  . 

no,  yw 

=  1.350,  U{s)  =4.319 

.029 

.142 
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710.236 

123.403 

-1.625 

.055 

24.648 

.094 

-8.622 

-.334 

-8.224 

.462 

23.626 

-.110 

2.862 

.427 

-16.186 

-.267 

-.268 

.014 

-1.160 

-.336 

A  =  .102,  V (s)  =  1.236,  U(s)  =  5.475,  0  =  .827 
10.20  Using  a  backward  elimination  based  on  (10.72),  we  obtain  the  following  partial 


A -values: 

Step  X] 

x2 

*3 

X4 

*5 

A 6 

*7 

*8 

x9 

1 

.993 

.962 

.916 

.958 

.919 

.879 

.981 

.999 

.797 

2 

.994 

.962 

.916 

.956 

.909 

.874 

.980 

.626 

3 

.951 

.883 

.954 

.912 

.873 

.981 

.626 

4 

.948 

.884 

.955 

.861 

.867 

.561 

5 

.953 

.862 

.840 

.803 

.561 

6 

.830 

.781 

.783 

.535 

At  step  6,  we  stop  and  retain  all  four  x’s  because  each  A  has  a  /7-value  less 
than  .05. 


CHAPTER  11 

11.1  By  (3.38),  Sy},  =  DVRVVDV  and  SAA  =  DVR,VVDA-,  where  Dv  and  I)A  are 
defined  below  (11.14).  Similarly,  SyA  =  Dy  Ryx  Dy  and  SAV  =  DyRyyDy. 
Substitute  these  into  (11.7),  replace  I  by  1);  1  Dy,  and  factor  out  Dv  on  the 
right. 

11.2  Multiply  (11.7)  by  S~A' SAJ,  on  the  left  to  obtain  (S7ySyyS~ySy.vS~yS.vy  — 
r2S7ySAy)a  =  0.  Factor  out  S7,1  SAV  on  the  right  to  write  this  in  the  form 
^S.v.y'  S.vySyy  S  y.y  —  /-“IlSyyS.vya  =  0.  Upon  comparing  this  to  (11.8),  we  see 
that  b  =  S7A  SAVa. 

11.3  When  p  =  1,  s  is  also  1,  and  there  is  only  one  canonical  correlation,  which 
is  equal  to  R 2  from  multiple  regression  [see  comments  between  (11.28)  and 
(11.29)].  Thus 


A 


!  -  R 


2 

f 


1  -  R}.  ' 
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(1  _  A)(„  -q-  1)  [1  -  (1  -  R})/{  1  -  R?)](n  -  q  -  1) 

A/i  [(1  -  R})/(  1  - 

[1  _  *2  _  (1  _^2)](n_  g  _  1} 

~~  (1  -  R})h 

(Rj  -  Rf.)(n  -q  -  1) 

~~  (1  -  R})h 

11.5  By  (11.39), 


r}  = 


1  +  Xj 


+  rf  Xj  —  Xi , 


Xi(  1  -  rf)  =  rf. 


11.6  Substitute  E  =  (n  —  1)(SVV  —  Sy^Sj Sx>>)  and  H  =  (n  —  I  jSyySjSyy  from 
(11.44)  and  (11.45)  into  (11.41):' 


(n  —  1  )S 


Ha  =  XEa, 

yXiJXX  ^7^  =  (it  I  )A.(Syy 


yS7U 


S.V.V  Syy  SX  y) 


Sy.ySjS.yya  =  X  (Syy  -  Sy.y  Sj  S.yy  )  3 . 


11.7  By  (11.42),  SyA-Svl1S.yVa  =  r  2  Syy  a.  Subtracting  r2SVySj  SXy,a  from  both 
sides  gives 


Sy.ySj  Syy.a  -  F 2 S  y.y  S  J  S.y  y  A  =  ^SyyS  ~  /^Sy.ySj  Sy^TI. 

(1  -  I*^)  S  y.y  Sj^f  S.y  y  3  =  ^  (Syy  ~  Sy.y  S“ '  S.y  y )  S. 


(a) 

n  =  .5142, 

r2  =  .1255 

(b) 

Cl 

c2 

di 

d2 

y,  1.020 

—  .048  Xi 

.436 

.823 

y2  -.160 

1.009  x2 

-.704 

-.455 

As 

1.081 

-.401 

(c) 

k  A 

Approximate  F 

p-Value 

1  .7240 

2.395 

.035 

2  .9843 

.336 

.716 

(a) 

ri  =  .7885, 

r2  =  .0537 

(b) 

Cl 

c2 

di 

d2 

yi  .5522 

-1.3664  .ri 

.5044 

-1.7686 

y2  .5215 

1.3784  x2 

.5383 

1.7586 

(c) 

k  A 

Approximate  F 

p-Value 

1  .3772 

6.5972 

.0003 

2  .9971 

.0637 

.8031 
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11.10  (a)  n  =  .4900,  r2  =  .3488,  r3  =  .0609 


(b) 

Cl 

c2 

c3 

di 

d2 

^3 

yi 

.633 

.091 

.806 

Xl 

.482 

-.262 

1.054 

T2 

-.624 

.816 

.147 

x2 

-.578 

1.024 

-.059 

T3 

.643 

.400 

-.690 

x3 

.865 

.216 

-.626 

(c) 

k 

A 

Approximate  F 

p-Value 

1 

.665 

2.175 

.029 

2 

.875 

1.552 

.194 

3 

.996 

.171 

.681 

11.11  (a) 

r\ 

=  .6251,  r2  =  .4135 

(b) 

Cl 

c2 

yi 

1.120 

-.007 

y2 

-.498 

1.003 

di 

^2 

Xi 

1.091 

-.794 

x2 

.184 

-.288 

X3 

.842 

1.807 

X4 

.944 

.641 

X5 

1.040 

-.154 

X6 

.215 

1.256 

X7 

-.603 

-.528 

X8 

-.641 

-.588 

k 

A 

Approximate  F 

p-Value 

1 

.4642 

1.1692 

.3321 

2 

.7553 

.9718 

.4766 

11.12  (b)  By  (11.34), 


A(*7,  X8|Xi,  X2,  ...  ,  X6)  = 


nhu-rf) 

n-=id  -cf)’ 


where  r2  and  are  the  squared  canonical  correlations  from  the  full  model, 
and  c2  and  c?  are  the  squared  canonical  correlations  from  the  reduced 
model: 


(1  -  .62082) ( 1  -  ,49472) 
A<*7'**l*l-‘2 . «)  =  a  -  ,26502)(1  -  .0886*) 


.4643 

.9225 


=  .5033 


(c)  A(x4,  X5,  X6|xi,  X2,  X3,  X7,  Xg) 


(1  -  .62082)(1  -  ,49472) 
(1  -  .33012)(1  -  .17072) 
.4643 


.8651 


=  .5367 
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(1  -  ,62082)(1  -  ,49472) 

-  (1  -  ,48312)(1  -  .2 1 852) 

.4643 

=  =  .6359 

.7300 

11.13  (a)  n  =  -9279,  r2  =  .5622,  r3  = 

.1660, 

k  A  Approximate  F 

p-Value 

1  .0925 

17.9776 

<.0001 

2  .6651 

4.6366 

.0020 

3  .9725 

1.1898 

.2816 

(b)  r\  =  .8770,  n  =  .6776,  r3  = 

.3488, 

k  A  Approximate  F 

p  -Value 

1  .1097 

6.919 

<.0001 

2  .4751 

3.427 

.001 

3  .8783  1.351 

(c)  n  =  -9095,  n  =  .6395, 

.269 

k  A  Approximate  F 

p-Value 

1  .1022 

8.2757 

<.0001 

2  .5911 

3.1129 

.0089 

(d)  ri  =  .9029,  n  =  .7797,  r3  = 

.3597,  r4  =  .3233,  r5  =  .0794, 

k  A  Approximate  F 

p  -Value 

1  .0561 

4.992 

<.0001 

2  .3037 

2.601 

.0007 

3  .7747 

.829 

.6210 

4  .8898 

.761 

.6030 

5  .9937 

.124 

.8840 

CHAPTER  12 

12.1  From  A  =  a'Sa/a'a  in  (12.7),  we 

obtain  Aa'a  =  aSa,  which  can  be  factored 

as  a'(Sa  —  Aa)  = 

=  0.  Since  a  =  0 

is  not  a  solution  to  A  =  a'Sa/a'a,  we  have 

o 

II 

1 

cs 

C/3 

12.2  |R- AI|  =  0, 

1  -  A  r 

r  1  —  A 

=  (1  —  A)2  —  r2  =  0, 

(1  -A  +  r)(l  -A-r)  =  0,  A=l±r 
With  Aj  =  1  +  /■  in  (R  —  A] I)ai  =  0,  we  obtain 


which  gives  flu  =  a\2  for  any  r.  Normalizing  to  a!x a i  =  1,  yields  an  =  l/\/2. 
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12.3  (a)  By  (4.14)  and  a  comment  in  Section  7.1,  the  likelihood  ratio  is  given  by 
LR  =  (|S|/|So|)"/2,  where  So  is  the  estimate  of  2  under  Ho.  By  (2.108) 
and  (7.6),  the  test  statistic  is 


— 21nLR  =  -2  In 


9  \  ln  nf=i  k  nr=P-fr+i 


=  —n  ln 


nf=P-/t+i  ^ 


E 


i=p—k+ 1 


In  (12.15),  the  coefficient  n  is  modified  to  give  an  improved  chi-square 
approximation. 

12.4  If  S  is  diagonal,  then  A.,-  =  su,  as  in  (12.17).  Thus 


•Sa,  —  /. :  a,  —  su  a/, 


0 

0 

(  an  > 

(  s\\a,\  ^ 

(  snan  ^ 

0  S'22  0 

an 

S22«/2 

= 

Sii&i2 

V  0  0  • • •  spp  ) 

\  aip  ) 

\  SppClip  J 

\  $ii  &ip  / 

From  the  first  element,  we  obtain  s\\an  —  snan  or  (su  —  sn)an  =  0. 
Since  su  —  sn  ^  0  (unless  i  =  1),  we  must  have  an  —  0.  Thus,  a ,  = 
(0,  . . .  ,0,  an ,  0,  . . .  ,  0)',  and  normalizing  a,  leads  to  an  =  1. 

12.5  By  (10.34)  and  (12.2), 


d2  _ 

yi\zi....,zk 

SI-1 

^ytz^zz 

syiz 

sr 

yi 

(  ^ 

0  ■  ■ 

•  0  \ 

-1 

^  syizi  ^ 

■  >  syiZ2>  ■  ■  ■ 

>  sytzk> 

0 

$2  .  . 
SZ2 

■  0 

0 

syizi 

V  0  0  •  •  •  sjk  )  \  syiZk  ) 


Show  that  this  is  equal  to 


k  s2 

r2  \'  yitj 

ydzi.-.z*  L-i  n  n 

j=  i  z-i  yi 


12.6  The  variances  of  yi,  >’2,  x\,  xj,  and  A3  on  the  diagonal  of  S  are  .016,  70.6, 
1 106.4,  2381.9,  and  2136.4.  The  eigenvalues  of  S  and  R  are  as  follows: 
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S 

R 

h 

Cumulative 

h 

Cumulative 

3466.18 

.608607 

.60861 

1.72 

.34 

.34 

1264.47 

.222021 

.83063 

1.23 

.25 

.59 

895.27 

.157195 

.98782 

.96 

.19 

.78 

69.34 

.012174 

.99999 

.79 

.16 

.94 

.01 

.000002 

1.00000 

.30 

.06 

1.00 

Two  principal  components  of  S  account  for  83%  of  the  variance,  but  it 
requires  three  principal  components  of  R  to  reach  78%.  For  most  purposes 
we  would  use  two  components  of  S,  although  with  three  we  could  account  for 
99%  of  the  variance.  However,  we  show  all  five  eigenvectors  below  because  of 
the  interesting  pattern  they  exhibit.  The  first  principal  component  is  largely  a 
weighted  average  of  the  last  two  variables,  xi  and  X3,  which  have  the  largest 
variances.  The  second  and  third  components  represent  contrasts  in  the  last 
three  variables  and  could  be  described  as  “shape”  components.  The  fourth  and 
fifth  components  are  associated  uniquely  with  V2  and  yi,  respectively.  These 
components  are  “variable  specific,”  as  described  in  the  discussion  of  method  1 
in  Section  12.6.  As  expected,  the  principal  components  of  R  show  an  entirely 
different  pattern.  All  five  variables  contribute  to  the  first  three  components  of 
R,  whereas  in  S,  y\  and  y2  have  small  variances  and  contribute  almost  nothing 
to  the  first  three  components.  The  eigenvectors  of  S  and  R  are  as  follows: 


S  R 


ai 

a2 

a3 

a4 

a5 

ai 

a2 

a3 

a4 

a5 

Ti 

.0004 

-.0008 

.0018 

.0029 

.9999 

.42 

.53 

-.42 

-.40 

.46 

T2 

-.0080 

.0166 

.0286 

.9994 

-.0029 

.07 

.68 

.16 

.70 

-.10 

Xi 

.1547 

.6382 

.7535 

-.0309 

-.0008 

.36 

.20 

.76 

-.44 

-.24 

*2 

.7430 

.4279 

-.5145 

.0136 

.0009 

.54 

-.43 

.25 

.39 

.56 

x3 

.6511 

-.6397 

.4083 

.0042 

-.0015 

.63 

-.18 

-.40 

.10 

-.64 

/  65.1 

33.6 

47.6 

36.8 

25.4  \ 

33.6 

46.1 

28.9 

40.3 

28.4 

47.6 

28.9 

60.7 

37.4 

41.1 

36.8 

40.3 

37.4 

62.8 

31.7 

v  25.4 

28.4 

41.1 

31.7 

58.2  / 

/  1.00 

.61 

.76 

.58 

■41  \ 

.61 

1.00 

.55 

.75 

.55 

.76 

.55 

1.00 

.61 

.69 

.58 

.75 

.61 

1.00 

.52 

V  -41 

.55 

.69 

.52 

1.00  / 

R  = 
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The  eigenvalues  of  S  and  R  are  as  follows: 


S 

R 

A* 

Cumulative 

k, 

h/Zjh 

Cumulative 

200.4 

.684 

.684 

3.42 

.683 

.683 

36.1 

.123 

.807 

.61 

.123 

.806 

34.1 

.116 

.924 

.57 

.114 

.921 

15.0 

.051 

.975 

.27 

.054 

.975 

7.4 

.025 

1.000 

.13 

.025 

1.000 

The  first  three  eigenvectors  of  S  and  R 

are  as  follows: 

S 

R 

ai 

a2 

a3 

ai 

a2 

a3 

.47 

-.58 

-.42 

.44 

-.20 

-.68 

.39 

-.11 

.45 

.45 

-.43 

.35 

.49 

.10 

-.48 

.47 

.37 

-.38 

.47 

-.12 

.62 

.45 

-.39 

.33 

.41 

.80 

-.09 

.41 

.70 

.41 

The  variances  in  S  are  nearly  identical,  and  the  covariances  are  likewise  similar 
in  magnitude.  Consequently,  the  percent  of  variance  explained  by  the  eigenval¬ 
ues  of  S  and  R  are  indistinguishable.  The  interpretation  of  the  second  principal 
component  from  S  is  slightly  different  from  that  of  the  second  one  from  R,  but 
otherwise  there  is  little  to  choose  between  them. 

12.8  The  variances  on  the  diagonal  of  S  are  95.5,  73.2,  76.2,  808.6,  505.9,  and 
508.7.  The  eigenvalues  of  S  and  R  are  as  follows: 


S 

R 

A; 

h/Ejh 

Cumulative 

h 

h/Ej  A; 

Cumulative 

1152.0 

.557 

.557 

2.17 

.363 

.363 

394.1 

.191 

.748 

1.08 

.180 

.543 

310.8 

.150 

.898 

.98 

.163 

.706 

97.8 

.047 

.945 

.87 

.144 

.850 

68.8 

.033 

.978 

.55 

.092 

.942 

44.6 

.022 

1.000 

.35 

.058 

1.000 

We  could  keep  either  two  or  three  components  from  S.  The  first  three  com¬ 
ponents  of  S  account  for  a  larger  percent  of  variance  than  do  the  first  three 
from  R.  The  first  three  eigenvectors  of  S  and  R  are  as  follows: 
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S 

R 

ai 

a2 

a3 

ai 

a2 

83 

.080 

.092 

-.069 

.336 

.176 

.497 

.034 

-.018 

.202 

.258 

.843 

-.093 

.076 

.122 

-.011 

.370 

.049 

.466 

.758 

-.446 

-.469 

.475 

-.329 

-.358 

.493 

-.081 

.844 

.486 

.079 

-.567 

.412 

.878 

-.147 

.471 

-.376 

.278 

As  expected,  the  first  three  principal  components  from  S  are  heavily  influ¬ 
enced  by  the  last  three  variables  because  of  their  relatively  large  variances. 

12.9  The  variances  on  the  diagonal  of  S  are  .69;  5.4;  2,006,  682.4;  90.3;  56.4;  18.1. 
With  the  large  variance  of  >'3,  we  would  expect  the  first  principal  component 
from  S  to  account  for  most  of  the  variance,  and  V3  would  essentially  constitute 
that  single  component.  This  is  indeed  the  pattern  that  emerges  in  the  eigen¬ 
values  and  eigenvectors  of  S.  The  principal  components  from  R,  on  the  other 
hand,  are  not  dominated  by  y-$.  The  eigenvalues  of  S  and  R  are  as  follows: 


S 

R 

*1 

h/T,j  h 

h 

Cumulative 

2,006,760 

.999954 

2.42 

.404 

.404 

65 

.000033 

1.40 

.234 

.638 

18 

.000009 

1.03 

.171 

.809 

7 

.000003 

.92 

.153 

.963 

3 

.000001 

.20 

.033 

.996 

0 

.000000 

.02 

.004 

1.000 

Most  of  the  correlations  in  R  are  small  (only  three  exceed  .3),  and  its  first 
three  principal  components  account  for  only  72%  of  the  variance.  The  first 

three  eigenvectors  of  S  and  R 

are  as  follows: 

S 

R 

ai 

a2 

a3 

ai 

a2 

a3 

.00016 

.005 

-.0136 

.424 

-.561 

-.150 

.00051 

.017 

.0787 

.446 

-.528 

.087 

.99998 

-.001 

-.0002 

.563 

.387 

-.051 

.00529 

.698 

.0174 

.454 

.267 

.166 

.00322 

-.716 

.0195 

.303 

.425 

-.296 

.00020 

.025 

.9965 

.073 

.069 

.923 
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12.10  Covariance  matrix  for  males: 


/  5.19 

4.55 

6.52 

5.25 

\ 

Sm  = 

4.55 

13.18 

6.76 

6.27 

6.52 

6.76 

28.67 

14.47 

v  5.25 

6.27 

14.47 

16.65 

/ 

Covariance  matrix  for  females: 

f  9.14 

7.55 

4.86 

4.15 

\ 

S  F  = 

7.55 

18.60 

10.22 

5.45 

4.86 

10.22 

30.04 

13.49 

v  4.15 

5.45 

13.49 

28.00 

/ 

The  eigenvalues  are  as  follows: 


Males 

Females 

A* 

Xi/T,j  xj 

Cumulative 

A,- 

Xi/T,j  A; 

Cumulative 

43.56 

.684 

.684 

48.96 

.571 

.571 

11.14 

.175 

.858 

18.46 

.215 

.786 

6.47 

.102 

.960 

13.54 

.158 

.944 

2.52 

.040 

1.000 

4.82 

.056 

1.000 

The  first  two  eigenvectors  are  as  follows: 


Males 

Females 

ai 

a2 

ai 

a2 

.24 

.21 

.22 

.27 

.31 

.85 

.39 

.62 

.76 

-.48 

.68 

.17 

.52 

.09 

.58 

-.72 

The  variances  in  S m  have  a  slightly  wider  range  (5.19-28.67)  than  those  in 
S/--  (9.14-30.04),  and  this  is  reflected  in  the  eigenvalues.  The  first  two  compo¬ 
nents  account  for  86%  of  the  variance  from  Sm,  whereas  the  first  two  account 
for  79%  from  S  /-- . 

12.11  Covariance  matrix  for  species  1: 


187.6 

176.9 

48.4 

113.6 

176.9 

345.4 

76.0 

118.8 

48.4 

76.0 

66.4 

16.2 

113.6 

118.8 

16.2 

239.9 
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Covariance  matrix  for  species  2: 


/ 

101.8 

128.1 

37.0 

32.6 

128.1 

389.0 

165.4 

94.4 

37.0 

165.4 

167.5 

66.5 

V 

32.6 

94.4 

66.5 

177.9 

The  eigenvalues  are  as  follows: 


Species  1 

Species  2 

h 

Cumulative 

X; 

Cumulative 

561.3 

.669 

.669 

555.7 

.664 

.664 

169.0 

.201 

.870 

145.4 

.174 

.838 

65.3 

.078 

.948 

93.5 

.112 

.950 

43.7 

.057 

1.000 

41.7 

.050 

1.000 

The  first  two  eigenvectors  are  as  follows: 


Species  1 

Species  2 

ai 

a2 

ai 

a2 

.50 

.01 

.28 

-.20 

.72 

-.48 

.81 

-.34 

.17 

-.22 

.42 

.14 

.45 

.85 

.30 

.91 

The  variances  in  Si  have  a  wider  range  than  those  in  S2,  and  the  hist  two 
components  of  Si  account  for  a  higher  percent  of  variance. 

12.12  The  variances  on  the  diagonal  of  S  in  each  case  are: 

(a)  Pooled:  536.0,  59.9,  116.0,  896.4,  248.1,  862.0, 

(b)  Unpooled:  528.2,  68.9,  145.2,  1366.4,  264.4,  1069.1. 

The  eigenvalues  are  as  follows: 


Pooled 

Unpooled 

h 

h/i 

Cumulative 

h 

WLj  h 

Cumulative 

1050.6 

.386 

.386 

1722.0 

.500 

.500 

858.3 

.316 

.702 

878.4 

.255 

.755 

398.9 

.147 

.849 

401.4 

.117 

.872 

259.2 

.095 

.944 

261.1 

.076 

.948 

108.1 

.040 

.984 

128.9 

.037 

.985 

43.4 

.016 

1.000 

50.4 

.015 

1.000 
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The  first  three  eigenvectors  are  as  follows: 


Pooled 

Unpooled 

ai 

a2 

a3 

ai 

a2 

a3 

.441 

-.190 

.864 

.212 

.389 

.888 

.041 

-.038 

.082 

-.039 

.064 

.096 

-.039 

.031 

.143 

.080 

-.066 

.081 

.450 

.892 

-.033 

.776 

-.608 

.081 

-.019 

-.001 

-.054 

-.096 

.010 

.015 

.774 

-.407 

-.471 

.580 

.686 

-.434 

(c)  The  pattern  in  eigenvalues  as  well  as  eigenvectors  is  similar  for  the  pooled 
and  unpooled  cases.  The  first  three  principal  components  account  for 
87.2%  of  the  variance  in  the  unpooled  case  compared  to  84.9%  for  the 
pooled  case. 

12.13  The  variances  on  the  diagonal  of  S  in  each  case  are: 

(a)  Pooled:  49.1,  8.1,  12140.8,  136.2,  210.8,  2983.9, 

(b)  Unpooled:  63.2,  8.0,  15168.9,  186.6,  255.4,  4660.7. 

The  eigenvalues  are  as  follows: 


Pooled 

Unpooled 

h 

VEj 

Cumulative 

h 

VI Ijh 

Cumulative 

12,809.0 

.8249 

.8249 

17,087.0 

.8400 

.8400 

2,455.9 

.1582 

.9830 

2,958.0 

.1454 

.9854 

137.1 

.0088 

.9918 

168.6 

.0083 

.9937 

77.2 

.0050 

.9968 

77.1 

.0038 

.9974 

42.2 

.0027 

.9995 

44.7 

.0022 

.9996 

7.4 

.0005 

1.0000 

7.3 

.0004 

1.0000 

The  eigenvectors  are  as  follows: 

Pooled 

Unpooled 

ai 

a2 

ai 

a2 

-.004 

-.000 

.013 

.027 

-.005 

.004 

-.004 

.004 

.968 

-.233 

.931 

-.355 

-.002 

.023 

.028 

.069 

.103 

.041 

.103 

.021 

.228 

.971 

.350 

.932 

12.14  The  variances  on  the  diagonal  of  S  are  all  less  than  1,  except  =  5.02 
and  ,v^  =  1541.08.  We  therefore  expect  the  last  variable,  xg,  to  dominate  the 
principal  components  of  S.  This  is  the  case  for  S  but  not  for  R.  The  eigenvalues 
of  S  and  R  are  as  follows: 
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S  R 


h 

h 

Cumulative 

1541.55 

.996273 

3.174 

.317 

.317 

4.83 

.003123 

2.565 

.256 

.574 

.44 

.000286 

1.432 

.143 

.117 

.27 

.000174 

1.277 

.128 

.845 

.10 

.000066 

.542 

.054 

.899 

.07 

.000043 

.473 

.047 

.946 

.02 

.000014 

.251 

.025 

.971 

.02 

.000011 

.118 

.012 

.983 

.01 

.000005 

.104 

.010 

.994 

.00 

.000003 

.064 

.006 

1.000 

The  eigenvectors  of  S  and  R  are  as  follows: 

S 

R 

ai 

a2 

ai 

a2 

a3 

a4 

.0009 

-.005 

.12 

.19 

.69 

.10 

.0007 

-.034 

.06 

.32 

.54 

.26 

.0029 

-.007 

.46 

-.06 

.07 

-.38 

.0014 

.004 

.29 

.17 

-.18 

.49 

.0059 

-.009 

.52 

.14 

-.04 

-.01 

-.0150 

.982 

-.09 

-.42 

.07 

.55 

-.0028 

-.092 

-.31 

.45 

-.01 

-.14 

-.0022 

-.158 

-.23 

.54 

-.14 

-.10 

.0044 

-.011 

.09 

.36 

-.38 

.44 

.9998 

.014 

.50 

.11 

-.13 

-.09 

The  variances  in  the  diagonal  of  S  are:  55.7,  10.9,  402.7,  25.7, 13.4, 438.3,  1.5, 

106.2,  885.6,  22227.2,214.1, 

The  eigenvalues  of  S  and  R  are  as  follows: 

S 

R 

K 

A;  /  'kj  Cumulative 

h 

Cumulative 

22,303.5 

.91479 

.91479 

6.020 

.54730 

.54730 

1590.7 

.06524 

.98003 

2.119 

.19267 

.73996 

358.0 

.01469 

.99471 

1.130 

.10275 

.84272 

63.4 

.00260 

.99731 

.760 

.06909 

.91181 

29.3 

.00120 

.99852 

.355 

.03231 

.94411 

17.1 

.00070 

.99922 

.259 

.02358 

.96769 

12.7 

.00052 

.99974 

.122 

.OHIO 

.97879 

2.8 

.00012 

.99986 

.110 

.01004 

.98883 

1.9 

.00008 

.99994 

.060 

.00544 

.99427 

.9 

.00004 

.99997 

.042 

.00384 

.99810 

.7 

.00003 

1.00000 

.021 

.00190 

1.00000 
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The  eigenvectors  of  S  and  R  are  as  follows: 


s 

R 

ai 

a2 

ai 

a2 

a3 

a4 

Vi 

-.0097 

.1331 

.3304 

-.0787 

.0880 

-.2807 

,V2 

.0006 

.0608 

.3542 

.1928 

.1071 

-.2301 

,V3 

-.0141 

.4397 

.3923 

.0518 

.1105 

-.1413 

,V4 

-.0033 

.1078 

.3820 

.0474 

.1334 

-.0104 

Vs 

.0101 

.0398 

.2323 

.5303 

.0154 

-.0710 

ye 

.0167 

.4290 

.3621 

.2361 

.1198 

.1350 

v? 

-.0012 

-.0072 

-.0884 

.0213 

.7946 

.5414 

Vs 

.0275 

-.1844 

-.2501 

.5023 

.0826 

-.1506 

,V9 

.0456 

-.6657 

-.3111 

.3595 

.2136 

-.2278 

VlO 

.9982 

.0346 

-.0243 

.4685 

-.4669 

.5001 

Til 

.0034 

.3311 

.3357 

-.1153 

-.1853 

.4550 

For  most  purposes,  one  or  two  principal  components  would  suffice  for  S,  with 
91%  or  98%  of  the  variance  explained.  For  R,  on  the  other  hand,  three  compo¬ 
nents  are  required  to  explain  84%  of  the  variance,  and  seven  components  are 
necessary  to  reach  98%.  The  reduction  to  one  or  two  components  for  S  is  due 
in  part  to  the  relatively  large  variances  of  y$,  y6,  V9,  and  yio.  In  the  eigenvec¬ 
tors  of  S,  we  see  that  these  four  variables  figure  prominently  in  the  first  two 
principal  components. 


CHAPTER  13 

13.1  var(>’,)  =  var(y,-  -  m)  =  var(A,i/i  +  A/2/2  H - h  hmfm  +  £/) 

=  Ej=  1  xlj  var<//')  +  var(e,-)  +  J2j^k  x>jxik  cov(fj,  /*) 

+  H’j=l  xij  COY(fj  >  si ) 

=  E'"=l  X]j  +  fi- 

The  last  equality  follows  by  the  assumptions  var (fj)  =  1,  var(e,)  = 
co v(fj,  fk )  =  0,  and  cov(/;-,  e,-)  =  0. 

13.2  cov(y,  f)=cov(Af+  e,  f)  [by  (13.3)] 

=cov(Af,  f)  [by  (13.10)] 

=E[ Af-  £(Af)][f  —  £(f)]'  [by  analogy  to  (3.31)] 
=E[Af-  AE(f)][f-  £(f)]' 

=A£[f  —  E(f)][f-  £(f)]' 

=Acov(f)  =  A 


[by  (13.7)] 
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13.3  E(  f*)  =  E  (T'f)  =  T'E(f)  =  TO  =  0, 
cov(r)  =  cov(T'f)  =  T'  cov(f)T  =  T'lT  =  I 

13.4  Let  E  =  S  -  (AA'  +  'P).  Then  by  (2.98).  tr(E'E)  =  elj-  By  (13.26), 
'P  =  diag(S  — AA'),  and  E  has  zeros  on  the  diagonal.  This  gives  the  inequality 

efj  <  sum  of  squared  elements  of  S  —  AA'. 

>j 


By  (2.98), 

Sum  of  squared  elements  of  S  —  A  A'  =  tr(S  —  AA')'(S  —  AA'). 

Since  S  —  AA'  is  symmetric,  we  have  by  (13.20),  (13.23),  and  (13.24), 

S  -  AA'  =  CDC'  -  CiDj^Dj^c; 
^CDC'-CiDiC;, 


where  C  =  (ci,C2, ...  ,  cp)  contains  normalized  eigenvectors  of  S,  D  = 
diag(0i,  62,  . . .  ,  9 p)  contains  eigenvalues  of  S,  Ci  =  (ci,  C2, . . .  ,  cm),  and 


Di  =  diag(0i,  02 - -  mo¬ 

using  the  partitioned  forms  C  =  (Ci,  C2)  and  D  = 


D,  O 
O  D2 


that  C'jCi  =  Im,C'1C2  =  O.C'Ci  = 


I  m 

o 


,  D 


1 m 

o 


,  show 

Di 
O 


Di 

O 


=  C1D1,  and  CDC'CiDiC^  =  CiDjC',.  Show  similarly  that 


CiDiC'jCDC'  =  CiDjC'j  and  CiDiC^CiDiC;  =  CiD2C'j.  Now  by  (2.97) 
tr(CD2C')  =  tr(C'CD2)  =  tr(D2)  =  £f=]  6>2.  Similarly,  tr(C|D2C')  = 
IX,  ^  Then 


tr(S  -  AA')'(S  -  AA')  =  tr(CDC'  -  CiDiC'1)(CDC'  -  CiDiC^) 

=  tr(CDC'CDC'  -  CDC'CiDiC;  -  CiDiC'jCDC' 
+CiDiC'1CiDiC,1) 

p  m  m  m 

=  X>XX>XX>XX>,2 

1=1  1=1  1=1  1=1 

p 

=  E  * 

i=m+ 1 
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13.5  EL,  E”.1  L,  =  EL,  [EL,  XJ]  =  EL,  *? 


[by  (13.28)] 


By  interchanging  the  order  of  summation,  we  have 


pm  m  p  m 

E  XEo-  =  E  E  §  =  E  eJ  (13-29)]- 

1  =  1  7  =  1  7=1  i'=l  7=1 

13.6  We  use  the  covariance  matrix  to  avoid  working  with  standardized  variables. 
The  eigenvalues  of  S  are  39.16,  8.78,  .66,  .30,  and  0.  The  eigenvector  corre¬ 
sponding  to  A5  =  0  is 


a'5  =  (-.75,  -.25,  .25,  .50,  .25). 
As  noted  in  Section  12.7,  s~5  =  0  implies  Z5  =  0.  Thus 


Z5  =  ajy  =  — .75yi  -  .25 y2  +  .25 y3  +  .50y4  +  .25.V5  =  0, 
3vi  +  y2  =  y3  +  2y4  +  js- 

13.7  Words  data  of  Table  5.9: 


Principal 

Component 

Loadings 

Varimax 

Rotated 

Loadings 

Communalities, 

fi 

h 

h 

h 

hj 

Variables 

Informal  words 

.802 

.535 

.956 

.129 

.930 

Informal  verbs 

.856 

.326 

.858 

.321 

.839 

Formal  words 

.883 

.270 

.484 

.786 

.853 

Formal  verbs 

.714 

.658 

.101 

.966 

.943 

Variance 

2.666 

.899 

1.894 

1.671 

3.565 

Proportion 

.666 

.225 

.474 

.418 

.891 

The  orthogonal  matrix  T  for  the  varimax  rotation  as  given  by  (13.49)  is 

/  .750  .661  \ 

V  -.661  .750  /  ' 


Thus  sin  <p  =  —.661  and  0  =  —41.4°.  A  graphical  rotation  of  —40°  would 
produce  results  very  close  to  the  varimax  rotation. 
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13.8  Ramus  bone  data  of  Table  3.6: 


Principal 

Component 

Loadings 

Varimax 

Rotated 

Loadings 

Communalities, 

Orthoblique 

Pattern 

Loadings 

/i 

h 

/i 

h 

/i 

h 

Variables 

8  years 

.949 

-.295 

.884 

.455 

.988 

-.108 

1.087 

8^  years 

.974 

-.193 

.830 

.545 

.986 

.106 

.900 

9  years 

.978 

.171 

.578 

.808 

.986 

.825 

.188 

9^  years 

.943 

.319 

.449 

.888 

.991 

1.099 

-.121 

Variance 

3.695 

.255 

2.005 

1.946 

3.951 

Proportion 

.924 

.064 

.501 

.486 

.988 

The  Harris-Kaiser  orthoblique  rotation  produced  loadings  for  which  the 
variables  have  a  complexity  of  1.  These  oblique  loadings  provide  a  much 
cleaner  simple  structure  than  that  given  by  the  varimax  loadings.  For  interpre¬ 
tation,  we  see  that  one  factor  represents  variables  1  and  2,  and  the  other  factor 
represents  variables  3  and  4.  This  same  clustering  of  variables  can  be  deduced 
from  the  varimax  loadings  if  we  simply  use  the  larger  of  the  two  loadings  for 
each  variable. 

The  correlation  between  the  two  oblique  factors  is  .87.  The  angle  between 
the  oblique  axes  is  cos-1  (.87)  =  29.5°.  With  such  a  small  angle  between  the 
axes  and  a  large  correlation  between  the  factors,  it  is  clear  that  a  single  factor 
would  better  represent  the  variables.  This  is  also  borne  out  by  the  eigenvalues 
of  the  correlation  matrix:  3.695,  .255,  .033,  and  .017.  The  first  accounts  for 
92%  of  the  variance  and  the  second  for  only  6%. 

13.9  Rootstock  data  of  Table  6.2: 


Principal 

Varimax 

Component 

Rotated 

Loadings 

Loadings 

Communalities, 

fi  h 

h  h 

hj 

Variables 


Trunk  4  years 

.787 

.575 

.167 

.960 

.949 

Extension  4  years 

.849 

.467 

.287 

.925 

.939 

Trunk  15  years 

.875 

-.455 

.946 

.280 

.973 

Weight  15  years 

.824 

-.547 

.973 

.179 

.978 

Variance 

2.785 

1.054 

1.951 

1.888 

3.839 

Proportion 

.696 

.264 

.488 

.472 

.960 
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The  rotation  was  successful  in  producing  variables  with  a  complexity  of  1 , 
that  is,  partitioning  the  variables  into  two  groups,  each  with  two  variables. 
13.10  (a)  Fish  data  of  Table  6.17: 


Principal 

Component 

Loadings 

Varimax 

Rotated 

Loadings 

Communalities, 

/i 

h 

/i 

h 

hi 

Variables 

yi 

.830 

-.403 

.874 

.294 

.851 

V2 

.783 

-.504 

.911 

.189 

.866 

Ts 

.803 

.432 

.270 

.871 

.831 

V4 

.769 

.497 

.200 

.893 

.838 

Variance 

2.537 

.850 

1.709 

1.678 

3.386 

Proportion 

.634 

.213 

All 

.420 

.847 

(b)  The  loadings  for  y\  and  y2  are  similar.  In  R  we  see  some  indication  of  the 
reason  for  this;  y\  and  V2  are  more  highly  correlated  than  any  other  pair  of 
variables,  and  their  correlations  with  y3  and  V4  are  similar: 


1.00 

.71 

.51 

.40 

.71 

1.00 

.38 

.40 

.51 

.38 

1.00 

.67 

.40 

.40 

.67 

1.00 

(c)  By  (13.58),  the  factor  score  coefficient  matrix  is 


.566 

-.109 

.636 

-.207 

-.130 

.584 

-.194 

.630 

\ 

/ 


where  A  is  the  matrix  of  rotated  factor  loadings  given  in  part  (a).  The 
factor  scores  are  given  by  (13.59)  as  follows: 
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Method  1 

Method  2 

Method  3 

/i 

h 

h 

h 

/i 

h 

.544 

1.151 

-.254 

.309 

-1.156 

2.104 

1.250 

-.254 

-.309 

-1.534 

-.321 

.878 

1.017 

1.120 

-1.865 

-1.558 

-.671 

.947 

-.147 

-1.583 

-.999 

-.690 

.067 

1.130 

.219 

-.103 

.520 

-.343 

-1.610 

-.458 

1.007 

.679 

.919 

-.111 

.557 

.491 

1.413 

-.186 

-.443 

-.018 

-.454 

1.157 

-.666 

-2.279 

-.265 

.676 

-.961 

.063 

1.057 

-1.870 

1.449 

-.295 

-.230 

1.721 

.388 

-.440 

1.371 

.295 

-1.309 

.054 

1.328 

-.298 

1.260 

-mi 

-1.766 

-.111 

.694 

-.033 

-.000 

-1.452 

-1.636 

-.048 

(d)  A  one-way  MANOVA  on  the  two  factor  scores  comparing  the  three  meth¬ 
ods  yielded  the  following  values  for  E  and  H: 


/  21.8606 

10.3073  \ 

'  13.1394 

-10.3073  \ 

l  10.3073 

25.2081  )’ 

K- 

-10.3073 

9.7919  j 

The  four  MANOVA  test  statistics  are  A  =  .3631,  V(s}  —  .6552,  Uis)  — 
1.7035,  and  9  =  .6259.  All  are  highly  significant. 

13.11  (a)  For  the  flea  data  of  Table  5.5,  the  eigenvalues  of  R  are  2.273,  1.081,  .450, 
and  .196.  There  is  a  noticeable  gap  between  1.081  and  .450,  and  the  first 
two  factors  account  for  83.9%  of  the  variance.  Thus  m  =  2  factors  seem 
to  be  indicated  for  this  set  of  data. 

(b) 


Principal 

Component 

Loadings 

Varimax 

Rotated 

Loadings 

Communalities, 

% 

Orthoblique 

Pattern 

Loadings 

fi 

fi 

fi 

h 

/i 

h 

Variables 

yi 

-.038 

.989 

-.025 

.990 

.980 

-.003 

.990 

yi 

.889 

.269 

.892 

.256 

.862 

.898 

.253 

T3 

.893 

-.157 

.891 

-.170 

.823 

.887  - 

.173 

T4 

.827 

-.073 

.823 

-.084 

.689 

.824 

.087 

Variance 

2.273 

1.081 

2.273 

1.081 

3.354 

Proportion 

.568 

.270 

.568 

.270 

.839 

(The  variance  explained  by  the  varimax  rotated  factors  remains  the 
same  as  for  the  initial  factors  when  rounded  to  three  decimal  places.) 
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(c)  In  this  case,  neither  of  the  rotations  changes  the  initial  loadings  apprecia¬ 
bly.  The  reason  for  this  unusual  outcome  can  be  seen  in  the  correlation 
matrix: 


/  1.00 

.18 

-.17 

-.07  \ 

.18 

1.00 

.73 

.59 

-.17 

.73 

1.00 

.59 

v  -.07 

.59 

.59 

1.00  , 

There  are  clearly  two  clusters  of  variables:  {yi}  and  { yj ,  >'3,  >’4}.  We  would 
expect  two  factors  corresponding  to  these  groupings  to  emerge  after  rota¬ 
tion.  That  the  same  pattern  surfaces  in  the  initial  factor  loadings  (based 
on  eigenvectors)  is  due  to  their  affiliation  with  principal  components.  As 
noted  in  Section  12.8.1,  if  a  variable  has  small  correlations  with  all  other 
variables,  the  variable  itself  will  essentially  constitute  a  principal  compo¬ 
nent.  In  this  case,  yi  has  this  property  and  makes  up  most  of  the  second 
principal  component.  The  first  component  is  comprised  of  the  other  three 
variables. 

13.12  (a)  For  the  engineer  data  of  Table  5.6,  the  number  of  eigenvalues  greater  than 
1  is  three,  but  the  three  account  for  only  70%  of  the  variance.  It  requires 
four  eigenvalues  to  reach  84%.  The  scree  plot  also  indicates  four  eigenval¬ 
ues. 

(b) 


Principal 

Component  Loadings 

Varimax 

Rotated  Loadings 

Communalities, 

/1 

h 

h 

fi 

/2 

h 

hj 

Variables 

,Vl 

.536 

.461 

.478 

-.063 

.834 

.170 

.729 

,V2 

-.129 

.870 

-.182 

-.357 

.100 

.818 

.806 

,V3 

.514 

-.254 

-.448 

.724 

-.026 

.068 

.529 

,V4 

.724 

-.366 

-.110 

.739 

.295 

-.193 

.670 

,V5 

-.416 

-.414 

.649 

-.484 

-.013 

-.729 

.766 

V6 

.715 

.124 

.420 

.239 

.800 

-.069 

.702 

Variance 

1.775 

1.354 

1.073 

1.493 

1.435 

1.275 

4.202 

Proportion 

.296 

.226 

.179 

.249 

.239 

.212 

.700 

(c)  The  initial  communality  estimates  for  the  six  variables  are  given  by 
(13.36)  as  .215,  .225,  .113,  .255,  .161,  .248.  With  these  substituted  for  the 
diagonal  of  R.  the  eigenvalues  of  R  —  are 


Eigenvalue 

.994 

.569 

.255 

-.025 

-.237 

-.339 

Proportion 

.816 

.468 

.209 

-.020 

-.195 

-.278 

Cumulative 

.816 

1.284 

1.493 

1.473 

1.278 

1.000 

ANSWERS  AND  HINTS  TO  PROBLEMS 


653 


The  principal  factor  loadings  and  varimax  rotation  are  as  follows: 


Principal 

Component  Loadings 

Varimax 

Rotated  Loadings 

Communalities, 

fi 

fi 

h 

h 

h 

h 

Variables 

yi 

.403 

.312 

.227 

.030 

.536 

.151 

.311 

yi 

-.106 

.569 

-.100 

-.288 

.083 

.505 

.345 

T3 

.343 

-.139 

-.197 

.413 

.060 

.037 

.176 

T4 

.559 

-.247 

-.090 

.564 

.233 

-.094 

.381 

Ts 

-.286 

-.246 

.328 

-.262 

-.088 

-.417 

.250 

T6 

.556 

.089 

.197 

.258 

.537 

.003 

.356 

(d)  The  pattern  of  loadings  is  similar  in  parts  (b)  and  (c),  and  the  interpretation 
of  the  three  factors  would  be  the  same. 

13.13  Probe  word  data  of  Table  3.5: 


Principal 

Component 

Loadings 

Varimax 

Rotated 

Loadings 

Communalities, 

hj 

Orthoblique 

Pattern 

Loadings 

h 

h 

h 

/2 

fi 

h 

Variables 

yi 

.817 

-.157 

.732 

.395 

.692 

.737 

.131 

yi 

.838 

-.336 

.861 

.271 

.815 

.963 

-.092 

T3 

.874 

.288 

.494 

.776 

.847 

.248 

.734 

T4 

.838 

-.308 

.844 

.292 

.798 

.931 

-.057 

Ts 

.762 

.547 

.244 

.905 

.879 

-.134 

1.023 

Variance 

3.416 

.614 

2.294 

1.736 

4.031 

Proportion 

.683 

.123 

.459 

.347 

.806 

The  loadings  for  yo  are  similar  to  those  for  V4  in  all  three  sets  of  loadings. 
The  reason  for  this  can  be  seen  in  the  correlation  matrix 


/ 

1.00 

.61 

.76 

.58 

41  \ 

.61 

1.00 

.55 

.75 

.55 

.76 

.55 

1.00 

.61 

.69 

.58 

.75 

.61 

1.00 

.52 

V 

.41 

.55 

.69 

.52 

1.00  / 

The  correlations  of  y2  with  y\ ,  y$,  and  vj  are  very  similar  to  the  correlations 
of  V4  with  }>| ,  >’3,  and  y$. 
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CHAPTER  14 


14.1  Adding  and  subtracting  x  and  y  in  (14.2)  (squared),  we  obtain 

p 

d2(x,  y)  =  £[(*;  -  x)  -  (yj  -  y)  +  (x  -  y)]2 
1=1 

p  p 

=  J2(xj  ~ 1)2  +  E(yi  -  i7)2  +  p(x  -  y )2 
1=1  1=1 

p 

-2j>i  -*Nyi  _  i7)- 
1=1 

The  other  two  terms  vanish  because  X^/(xl  —  3c)  =  Y^j(yj  —  y)  =  0. 
Substituting  u2  =  ^/=i  (X1  —  a)2  and  1,2  =  ~  7)2  and  adding  and 

subtracting  —2Jv^v^  =  —2vxvy,  we  obtain 


d2(x,  y)  =  u2  +  u2  -  2^v~v~  +  p{x  -  y)2  +  2vxvy 

;Ej= i(xj-x)(yj-y) 


-2  Jv;vj- 


1>2U2 


=  (v*  -  Py)2  +  p(x  -  y)2  +  2v x Vy  ( 1  -  rxy). 


14.2  (a)  Since  yAB  =  £\= f  yi/nAB,  we  have  by  (14.16), 


nAB 

sseab  =  J2(y>  -  JabY^  -  y as) 

1=1 

=  -  ^y'T/ty 

i=l  i  =  l  1=1 

nAB 

+  F  fABSAB 

i= 1 
^ab 

=  J2y'iyi  ~  ,lABfAByAB  -  nABfAByAB 
1  =  1 

+nABy'ABy  AB 

II AB 

=  X!y'-yi  _  nABy'AByAB- 
1=1 
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Similarly,  SSEA  =  Y!H\  -  nAy'AyA  and  SSEg  =  L"=i  ~ 

nBy'ByB- Now 


n  AByr ABy  Ab  =  ( nA  +  nB) 


(nAy A  +  nByB )’  (nAyA  +  nBy b) 


nA  +  nB 


n  a  +  nB 


n\y \yA  +  »A«R  y’AyB  +  nAnB  y’AyB  +  n\ y’ByB 
nA  +  nB 


SSEab  -  (SSEa  +  SSEg)  =  J^y'yi  -  J^y-y,  -  Jyjy / 

i=i  i=l  i=i 

+«AyAyA  +  nByByB  -  nABfAByAB 
—  nAy’Ay a  +  nBy'ByB  -  nABy'ABy AB- 


Show  that  when  the  right  side  of  (14.16)  is  expanded,  it  reduces  to  this 
same  expression  [see  Problem  14.3(b)], 

(b)  Multiplying  out  the  right  side  of  (14.16),  we  have 


«AyAyA  -  nAy'AyAB  -  nA y'AByA  +  nA y'AByAB  +  nB y'ByB 
-  nBy'ByAB  -  nBy'AByB  +  nB y'AByAB 
=  nA y'AyA  +  nB y'ByB  -  2(nAfA  +  nB y'B)yAB  +  (nA  +  nB)fAB yAB 
=  nA yAyA  +  nB y'ByB  -  2 (nA  +  nB) fAByAB  +  (nA  +  nB)fAByAB 
=  nAy'AyA  +  nB y'ByB  -  (nA  +  nB)y'AByAR. 

Substitute  yAB  =  (nA yA  +  nByB)/(nA  +  nB). 

14.3  (a)  Complete  linkage.  From  Table  14.2,  we  have 

D(C,  AB)  =  \D(C,  A)  +  \D(C,  B )  +  \\D(C,  A)  -  D(C,  B) \  (1) 

If  D(C,  A)  >  D(C,B),  then  \D(C,  A)  -  D(C,  B)\  =  D(C,A)~ 
D(C ,  B ),  and  equation  (1)  becomes  D(C ,  AB)  —  D(C ,  A).  If  D(C,  A)  > 
D(C ,  B),  then  \  D(C,  A)  —  D(C,  B)|  =  D{C,  B)  —  D(C,  A)  and  equation 
(1)  becomes  D(C,  AB)  =  D{C ,  B).  Thus  equation  (1)  can  be  written  as 
D(C ,  AB)  =  max[Z)(C,  A),  D(C ,  B)],  which  is  equivalent  to  (14.9),  the 
definition  of  distance  for  the  complete  linkage  method. 

(b)  Average  linkage.  From  Table  14.2,  we  have 

D(C,AB)  =  — — — D(C,A)+ — — — D(C,B).  (2) 

nA  +  nB  nA+nB 
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By  (14.10)  equation  (2)  can  be  written  as 

«C  «A 


D{C,  AB)  = 


nA 


1  "C 

—  EE%yi) 


i=l  7=1 


It  A  +  «C«A 

1  nB 

Hr  1  \  > \  > 

+ — - £2d(yi.yj) 

«A+«fi  nCnB 
1 


nc 

y 

nciflA  +  “f 

«C 


^d(yi,yj)  +  ^rf(y«,yj) 


U=i 


;=i 


1  "C 

= - J2J2d(y^y  j)> 

nCnAB 


which,  by  (14.10),  is  the  definition  of  distance  for  the  average  linkage 
method. 

(c)  Substitute  yAB  —  ( nAyA  +  n ByB )/(« a  +  n B )  in  the  left  side  of  (14.40)  in 
the  statement  of  Problem  14.3(c)  and  multiply  to  obtain 


ycyc 


lnAy'Ayc  +  2nA"By'AyB  _  2nBy'Byc 
nA  +  nB  (nA  +  nB )2  «a  +  nB 

+_«aFaf a_+  »2BfByB 


(nA+nB)2  (nA+nB)2 

Similarly,  multiply  on  the  right  side  of  (14.40)  to  obtain  the  same  result, 

(d)  Using nA  =  nB  inyAB  =  (nAyA+nByB)/(nA+nB)  in  (14.12),  we  obtain 
mAB  =  |(y a  +  yB)  in  (14.13).  Then  (14.40)  [see  part  (c)]  becomes 


(yc  -  mAB)'(yc  -  mAB)  =  \(yc  -  yA)'(yc  -  yA>  +  j(yc  -  YbY^c  ~  yB ) 

-  ? (Fa  -Fb)'(Fa  -y«). 


1  IT,  T,  IT, 


which  matches  the  parameter  values  for  the  median  method  in  Table  14.2. 

(e)  By  (14.19), 


,  n  A  +  n  b 

(yA  -  y b)  (y a  -  y b)  = - jab, 

nAnB 

and  we  have  analogous  expressions  for  (yc  —  yAB)'(y c  —  Fab )>  (Fc  ~ 
Fa)'(Fc  -Fa).  and  (Fc  -ysVCFc  -Fb)-  Then  (14.40)  in  part  (c)  becomes 


ANSWERS  AND  HINTS  TO  PROBLEMS 


657 


»C  +  HAB 
nCnAB 


IC(AB) 


\ |  nc  + 

n.-\  +  nB  J\  ncnA  J 
(  nB  V  nc  +  hb\ 

\nA  +  nBJ\  ncnB  ) 
nAnB  1  /ha  +  «b 


«A  +  nc  ,  hr  +  «c  ,  1 

- mc  H - - Jab- 

ncnAB  ncnAB  nAB 


Solve  for  Ic(AB)- 

14.4  If  y  —  0,  then  (14.20)  becomes 

D(C,  AB)  =  aAD(C,  A)  +  aBD(C,  B)  +  /3D(A,  B).  (1) 

By  (14.25),  we  have  D(A ,  C)  >  D(A ,  B)  and  D(B,  C)  >  D(A,  B).  Thus, 
replacing  D(C,  A)  and  D{C ,  B)  in  equation  (1)  by  D(A,  B),  we  obtain 

D(C,  AB)  >  aAD(A ,  B)  +  a BD(A,  B)  +  fiD(A,  B), 

which  is  equivalent  to  (14.26). 

14.5  (a)  f.iEE  vo  =  ^E  E(Ay</ + b>  =  ^  (A  E  + s»b) 

*  '  ,  =  1  /=  |  *  '  ;  =  1  ;  =  1  *  '  \  ij  / 


=  A(^—  £,yyoj+b  =  Ay..  +  b 
Show  similarly  that  v,.  =  Ay(-  +  b.  Then  by  (6.9),  we  have 


g 

II,  =  n  -  v..)(Vj.  -  v..)' 

;  =  1 


II 

s 

g 

+  b  -  (Ay 

+  b)][Ay( 

+  b  -  (Ay..  +  b)]' 

=  n  E(A>'- 

i 

-Ay..)  (Ay,- 

.  -  Ay..)' 

=«EA^ 

i 

-y..)(yi.  - 

y..)'A' 

[by  (2.27)] 

=  «a  E^(y/ 

i 

.  -y..)(y,-.  - 

-y..)'  a' 

[by  (2.45)] 

=  AHVA'. 

Show  similarly  that  Et,  =  AEVA'. 
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(b)  tr(E„)  =  tr(AEyA')  =  tr(A'AEv)  /  tr(Ev) 

(c)  |E„|  =  |AEVA'|  =  |A||EV||A'|  =  |A|2|EV|  =  c|Ey|,  where  c  >  0.  Thus 
minimizing  |EU|  is  equivalent  to  minimizing  |  Ey  | . 

(d)  trCE-1^)  =  trKAEyA'r'CAHyA')] 

=  tr[  ( A')  “ 1  E“ 1  A“ 1  AH  v  A'] 

=  tr[(A')-'E-1HvA'] 

=  tr[A,(A,)_1E“1Hy] 

=  tr(E7'Hv) 

14.6  There  are  p  parameters  in  each  /x, ,  \p{p  +  1)  unique  parameters  in  each  X/, 
and  g  —  1  unique  parameters  a, .  Thus  the  total  number  is 

8P  +  g[jP(P  +  1)]  +  8  ~  1  =  g[p  +  \ PiP  +  1)  +  1]  -  1 

=  \gYlp  +  p2  +  P  +  2]  -  1 

=  hop  +  p2  +  2)  -  1 
=  2*(P +D0>  + 2) -1. 

14.7  (a)  The  two-cluster  solution  from  single  linkage  puts  boy  No.  20  in  one  cluster 

and  the  other  19  boys  in  the  other  cluster. 

(b),  (c),  and  (d).  Based  on  the  change  in  distance,  average  linkage  and  the 
other  cluster  solutions  in  parts  (c)  and  (d)  clearly  indicate  two  clusters. 
These  solutions  generally  agree  and  also  correspond  to  a  division  into  two 
groups  seen  in  the  first  principal  component  in  Figure  12.5.  The  separation 
of  the  three  apparent  outliers  from  the  other  17  observations  is  less  pro¬ 
nounced  in  the  cluster  analyses  than  in  Figure  12.5.  Note  that  the  scale  of 
the  second  component  in  Figure  12.5  is  much  larger  than  that  of  the  first 
component,  so  the  separation  of  points  9,  12,  and  20  from  the  rest  is  not 
as  large  as  it  appears  in  the  figure.  Of  the  methods  in  parts  (b),  (c),  and 

(d),  only  flexible  beta  with  ft  —  —.50  and  —.75  place  points  9,  12,  and  20 
together  in  one  cluster.  All  others  place  9  and  12  in  one  of  the  clusters  and 
20  in  the  other. 

14.8  (a)  The  distance  between  centroids  of  the  two  clusters  is  V2994.9  =  54.7. 

(b)  From  the  dendrogram  produced  by  the  average  linkage  method,  the  largest 
change  in  distance  corresponds  to  a  two-cluster  solution. 

(c)  The  discriminant  function  completely  separates  the  two  clusters,  with  no 
overlap. 

14.9  (a)  Observation  22  seems  to  be  an  outlier,  because  it  forms  its  own  cluster  in 

both  the  single  linkage  and  average  linkage  methods.  The  cluster  consist¬ 
ing  of  observations  2,  21,  24,  26,  and  30  is  the  same  in  all  six  methods. 

(b)  The  discriminant  function  completely  separates  the  two  clusters,  with  no 
overlap. 
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14.10  (a)  The  following  five  clusters  were  found  using  as  seeds  the  five  observations 
that  are  mutually  farthest  apart. 

Cluster  1  2  3  4  5 

Observation(s)  9,15,16,  1,2,3,  6,7,8,  10,11,  14 

18.19  4,5,17  20  12,13 

In  the  plot  of  the  first  two  discriminant  functions,  observation  14  is 
relatively  far  removed  from  the  rest.  Clusters  1,  2,  and  3  are  somewhat 
closer  to  each  other. 

(b)  The  following  five  clusters  were  found  using  as  seeds  the  first  five  obser¬ 
vations. 

Cluster  1  2  3  4  5 

Observation(s)  1,3,4  2  5,17,  6,7,8,  9,10,11 

18,19  15,16,20  12,13,14 

The  plot  of  the  first  two  discriminant  functions  shows  a  pattern  different 
from  that  in  part  (a). 

(c)  The  following  five  clusters  were  found  using  as  seeds  the  centroids  of  the 
five-cluster  solution  resulting  from  Ward’s  method. 

Cluster  1  2  3  4  5 

Observation(s)  6,7,8,15,  5,9,17,  10,11,  1,2,  14 

16.20  18,19  12,13  3,4 

The  plot  of  the  first  two  discriminant  functions  shows  a  pattern  similar 
to  that  found  in  part  (a),  with  observation  14  isolated. The  dendrogram 
shows  that  Ward’s  method  gives  the  same  five-cluster  solution  as  the  k- 
means  result. 

(d)  The  following  five  clusters  were  found  using  the  A: -means  method  with 
seeds  equal  to  the  centroids  of  the  five  clusters  from  average  linkage. 

Cluster  1  2  3  4  5 

Observation(s)  6,7,8,15,  1, 2, 3, 4. 5,  10.11,  9  14 

16,20  17,18,19  12,13 

The  plot  of  the  first  two  discriminant  functions  shows  a  pattern  some¬ 
what  similar  to  that  in  part  (a). 
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In  the  dendrogram  for  average  linkage,  observations  9  and  14  are  iso¬ 
lated  clusters  in  the  five-cluster  solution,  which  is  identical  to  the  five- 
cluster  solution  using  A-means  clustering  with  these  seeds. 

(e)  Observation  14  does  not  appear  as  an  outlier  in  the  plot  of  the  first  two 
principal  compoments,  but  it  does  show  up  as  an  outlier  in  the  plot  of 
the  second  and  third  components.  The  solutions  found  in  parts  (a)  and  (c) 
seem  to  agree  most  with  the  principal  component  plots.  This  suggests  that 
a  different  number  of  initial  cluster  seeds  be  used. 

(f)  The  two  clustering  solutions  are  identical.  The  results  are  given  next. 


Cluster 

1 

2 

3 

Observations 

6,  7,  8,  15, 

9,  10,  11,  12, 

1,2,  3,4,  5, 

16,  20 

13,  14 

17,  18,  19 

(g)  The  clustering  solution  is  identical  to  that  found  in  part  (f),  which  indicates 
that  the  three-cluster  solution  is  appropriate. 

14.11  The  number  of  clusters  obtained  from  the  indicated  combinations  of  A  and  r 
are  shown  in  the  following  table.  Note  that  for  each  pair  of  values  of  A  and 
r,  the  value  of  r  was  increased  if  necessary  for  each  point  until  A  points  were 
included  in  the  sphere. 


k/r 

.2 

.4 

.6 

.8 

1.0 

1.2 

1.4 

1.6 

1.8 

2.0 

2 

10 

10 

10 

10 

8 

6 

4 

3 

3 

2 

3 

5 

5 

5 

5 

5 

3 

2 

2 

2 

2 

4 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

5 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

The  maximum  value  of  A  that  yields  a  two-cluster  solution  is  4. 

14.12  (a)  The  number  of  clusters  obtained  from  the  initial  combinations  of  A  and  r 
are  shown  in  the  following  table.  The  value  of  r  was  variable,  as  noted  in 
Problem  14.11. 


k/r 

.2 

.4 

.6 

.8 

1.0 

1.2 

1.4 

1.6 

1.8 

2.0 

2 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

2 

2 

2 

2 

2 

2 

2 

2 

2 

2 

4 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

(b)  The  plot  of  the  first  two  discriminant  functions  for  A  =  2  and  r  =  1  shows 
the  three  clusters  to  be  well  separated. 

(c)  The  plot  of  the  first  two  principal  components  shows  the  same  groupings 
as  in  the  plot  in  part  (b). 
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(d)  The  plot  of  the  discriminant  function  shows  wide  separation  of  the  two 
clusters,  which  do  not  overlap.  The  three-cluster  solution  found  in  part  (b) 
is  given  next 


Cluster  1 

Cluster  2 

Cluster  3 

Harpers 

Rosemaund 

Cambridge 

Morley 

Terrington 

Cockle  Park 

Myerscough 

Headley 

Sparsholt 

Seale-Hayne 

Sutton  Bonington 

Wye 

The  two-cluster  solution  found  in  part  (d)  merges  clusters  2  and  3  of 
part  (b). 


CHAPTER  15 
15.1 


By  (2.38), 


Hence, 


B  =  (I  —  -J )  A  (I  —  -J  ) 

V  n  /  \  n  / 

1  1  1 


=  A  -  -AJ  -  -JA  +  —  JAJ 


n  n 

(  £ jaiJ  \ 

<  a i.  > 

-Aj  =  - 

n  n 

£;■  aV 

= 

02. 

\  a»j  / 

\  &n.  J 

-AJ  =  -A(j,  j, . . .  ,  j)  =  (-Aj, . . .  ,  -Aj) 

n  n  \n  n  J 

(  a\,  ■  •  •  a\.  > 

a2.  ■  ■  ■  a2. 

\  Qn.  '  '  '  Qn.  J 


(1) 


(2) 


Show  that 
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-JA  = 


/  <2.1  <7.2 

a  i  fl.2 

V  a.i  a. 2 


<7 .;/  \ 
a.n 


M.n  / 


Using  equation  (2),  we  obtain 


1  ,  1 
-j'Aj=  -(1,1, 


1) 


/  \ 

Ej 

\  Ej  a»7  / 


=  3  J2a<j=a  - 


By  (3.63), 


2jaj  =  -1 


/  j'Aj  •  •  j'Aj  \ 


V  J'Aj  •  •  •  j'Aj  J 


(a..  ••  a..  \ 


\  a  ■  ■  ■  <7  .  / 


Hence  the  ij th  element  of  equation  (1 )  is  bj  j  =  a/j  —  a,-.  —  a.j  +  a... 

15.2  (a)  (Seber  1984,  pp.  236-237)  The  elements  of  B  =  (bjj)  are  defined  as  bjj  — 


ciij  —  a,.  —  aj  +  a..,  where  aij  —  —  ^<5?-.  Thus 

-2 dij  =  Sfj  =  (Z;  -  Zj)'(Zi  -  Zj) 
—  /;.  Z/  -f-  z  •  z  j  —  2zy  Zj . 


Then 


1  "  1 

-2a/.  =  -  Y^(~2aij)  =  -  ^(z-z +  z'^z,-  -  2z'z/) 
1  7=1  j 

j  j 

=  z/z<  +  7-  XX-Z>  _  2z/z- 


Similarly,  show  that 
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Solve  for  a/.,  a./,  anda..  and  substitute  into  bjj  —  a/j  —a,-  —a,j  +a.. 
to  obtain  bjj  —  z' Zj  —  z'-z.  —  z'zj  +  zz,  which  can  be  factored  as  bjj  — 
(zj  —  z)'(z j  —  z).  Hence 


/  (z,  -  z)'(zi  -  z) 


B  = 


\  On  -z)'Ol  -  z) 
/  Ol  -  z)'  \ 


Ol  -  z, .  .  . 


\  (Z«  -  Z)'  / 

=  ZCZ'C  [see  (10.13)]. 


(Zj  -  z)'(zn  -  z)  > 
On  -  Z)'On  -  z)  / 
,  z„  -  z) 


Thus  B  is  positive  semidefinite  (see  Section  2.7). 

(b)  If B  is  positive  semidefinite  of  rank  q,  then  by  (2.109)  and  Section  2.1 1.4, 
B  can  be  expressed  in  the  form  B  =  V  A  V',  where  V  =  (vi,  \2,  ■  ■  ■  ,  v„) 
is  an  orthogonal  matrix  of  eigenvectors  of  B,  and  A  is  a  diagonal  matrix 
of  eigenvalues,  q  of  which  are  positive,  with  the  rest  equal  to  zero.  Letting 
Ai  be  the  q  x  q  upper-left-hand  block  of  A  with  positive  eigenvalues 
and  Vi  =  (vi,v2, ...  ,  v(/ )  he  the  n  x  q  matrix  with  the  corresponding 
eigenvectors,  we  can  write  B  =  V  A  V'  as 

B  =  (V"V3>(  o  $)(?!) 

=  ViAiv;  =  ViaJ/2a|/2Vj 
=  ZZ',  (1) 


where  the  n  x  q  matrix  Z  is 

Z  =  VlA}/2  =  0/MV1,  ^2\2,  ■■■  ,  ^Yq) 

(  Z1  \ 

Z2 

V  z«  ) 

To  show  that  (z ,•  —  Zj)'(Zi  —  Zj)  is  equal  to  Sfj,  we  can  proceed  as  follows: 

(z i  -  Zj Y (zj  -  zj)  =  z-z i  +  z'jZj  -  2z'jZj.  (2) 


By  equation  (1),  we  have 
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(  Z1  ^ 

B  =  ZZ'  = 

Z2 

(Zl ,  Z2,  .  .  .  ,  Z/7) 

l  <  J 

<  z\Zj 

z'z2  •••  z'jZ„  \ 

= 

Z2Z1 

Z  Z  -  *  *  Z,  Zt; 

V  Z»Z1 

z;,z2  •••  z;,z„  / 

Hence  equation  (2)  becomes 

(z;  -  zj y (zj  -  zj) 

=  Z/Z'-  +  Z'/Z7  -  2z/z/ 

=  bn  +  bjj  -2bij.  (3) 

Show  that  substituting  bjj  =  cijj  —  —  Ti  j  +  a  into  equation  (3)  leads  to 

(z  j  —  Zj)'(Zj  —  Zj)  —  an  +  ajj  —  2a  jj  +  aj.  —  aj  +  aj  —  a j.. 


Show  that  the  symmetry  of  A  implies  aj .  =  a.j  and  a.j  =  Hence, 
( Zj  zj )  (zj  z.j )  =  an  ajj  2 ajj  —  2 ajj  —  5--, 


since  ajj  —  —  A Sr.  =  0  and  —2 an  =  Sr-.. 

11  1  l l  IJ 


b  b 

15.3  (a)  r  =  ^  pjcj  =  ^  Pj 

7  =  1  7  =  1 


Plj  P2j 
Pj  '  Pj  ' 


Paj 

Pj 


=^(/7|j,  P2j,  ■  ■  ■  ,  Paj)'  [by  (2.61)] 

=(EjPlj’T,jP2j’--  -  ’HjPajY 

={P\.,P2.,  ■■■  ,Pa.Y 


(b)  c'  =  =  J2  Pi  r'j  =  £ 


Pi 


(=1 


i=l 


Pi  i  m 
Pi.  ’  Pi.  ’ 


Pib 

Pi. 


=  Eo»  n,  Pi2,  ,  Pib)  [by  (2.61)] 


;=i 


=  (E,pa,Ei  p/2>  •••  ,J2iPib) 

=  (p.l,  P.2,  ■■■  ,  P.b) 

15.4  j  r  =  X;?=1  Pi.  =  E“=l  E;=i  Piy  =  E,y  «o7»  =  »/«  =  1. 
c  j  -  Ej=l  P.7  -  E;  n.7'/«  =  E/  El  «i//«  =  »/« 
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15.5  By  (15.8),  (15.9),  and  (15.10),  pij  —  njj /n,  pi  —  riijn,  and  p  j  =  n.j/n. 
Substituting  these  into  (15.25),  we  obtain 


n 


('1!L  _  n  \l(n..  _ 

1  v— v  V  n  n-  )  v— v  lJ  n  ' 

*  =Z2 - nTnJ -  =  L“ 


IJ 


IJ 


nj.tij 

n2 

m.n.j  ^2 


n  (n  -  JV  (n  • 

V ^  n2  yUlJ  n  >  _  Y^  vWi; - — ) 

ij  n2  ij  n 

15.6  (a)  Multiplying  numerator  and  denominator  of  (15.25)  by  /?/.,  we  obtain 

x1  =  ~  P‘  Pjf 

i  j  Pi.Pj 


=  £""■577 

,  j  r.j 


— C Pij  -  Pi  P  i ) 
Pi. 


=  ?br?(r  Pi) 

15.7  (a)  By  (15.29),  (15.10),  (15.12),  and  (15.18),  we  obtain 
X2  =  ^  npt.  (ri  -  c)/D<7 1  (r,-  -  c) 


/  P‘ 1  Pih 

=  - At.  •  ••  - - p.fc 

V  V  Pi.  Pi. 


(  p.  1 


o  v7  7-^1  \ 


\  0  ■■■  p.b  ) 


Pi. 


\  P-b  / 


/ai_pi  m-pb\{  p'  P1  ^ 

=  j2nPi  ‘ PL  pt 


p.  t 


p.b 


V  f  -w  / 


15.8  (a)  By  (15.9),  r  =  Pj.  Then  D,-1r  =  D^Pj  =  Rj  by  (15.15).  By  (15.13), 

r'j  =  1,  and  therefore  Rj  =  j.  Now 

D,7‘ (P  -  re  )  =  D(_1P  -  D7*rc'  =  R  -  Rjc'  =  R  -  jc'. 

15.9  By  (15.49),  z'.  =  y.  A  (ignoring  the  centering  on  y,).  Thus  the  squared 
Euclidean  distance  can  be  written  as 


(Z/  -  ZfcY (Zj  -zk)  =  (zj  -  z[.)(z,-  -  Z*) 

=  (y;A  -  y*A)(A'y, /  -  A'y*) 
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=  (y \  -  y*)AA'(y;  -  yk) 

=  (y i  -  y*)'(yf  -  y a), 


since  A  is  orthogonal. 

15.10  (a)  From  YCV  =  UA  in  (15.55).  we  have  YCVA-1  =  U.  Then 

UU'  =  YCVA_1A_1V'Y^, 

=  YcV(A-1)2V'Y'.  (1) 

Since  (A-1)2  =  diag(l/A2,  1/A.J, ....  1/A.2),  where  the  Xj’s  are  eigen¬ 
values  of  Y'CYC,  the  matrix  (A”1)2  contains  eigenvalues  of  (Y/Yc)_1  = 
[(«  -  1)S]_1  =  S“V(«  -  1)  [see  (2.115)  and  (2.116)].  The  matrix  Y 
contains  eigenvectors  of  Y/Yc  and  thereby  of  (Y/Yc)_1  (see  Section 
2.1 1.9).  Hence  we  recognize  V(A_1)2V'  as  the  spectral  decomposition  of 
(Y/Yc)_1  [see  (2.109),  (2.1 15),  and  (2.116)].  Therefore,  equation  (1)  can 
be  written  as 


UU'  =  YcV(A-‘)2VX  =  y^y'yj-'y'. 

=  YcS-‘Y;/(n  -  1). 

(b)  If  H  =  VA,  then  HH  =  VAAV'  =  VA2V'.  The  diagonal  matrix  A2 
contains  the  eigenvalues  Xj  of  the  matrix  Y'.YC.  Thus  by  (2.115),  VA2  V' 
is  the  spectral  decomposition  of  Y'CYC,  and 

HH'  =  YA2V'  =  Y;.Y(.  =  (n  -  1)S. 


15.11  By  (15.64),  (3.63),  and  (3.64)  (ignoring  n  —  1  and  assuming  the  y,  ’s  are  cen¬ 
tered), 


(u;  -  Ujfc)'(u,-  -  u*)  =  u'u,  +  u'kUk  -  2u'u k 

=  yjs_1y;  +  y^S_1yir  -  2yJS_1y* 
=  (y<  -  yA-),S_1(y/  -  y a). 


15.12  (a)  The  first  ten  rows  and  columns  of  the  matrix  B  are  as  follows. 


/  129849 

-26801 

-88750 

-53847 

-59118 

-26801 

2310 

17029 

11125 

14394 

-88750 

17029 

65973 

32378 

31044 

-53847 

11125 

32378 

27683 

31808 

-59118 

14394 

31044 

31808 

38141 

45383 

-11076 

-31085 

-18550 

-19161 

-73877 

18149 

68156 

30003 

27269 

81571 

-18662 

-56671 

-34154 

-37147 

112101 

-21852 

-79481 

-46882 

-51673 

\  80909 

-16306 

-54135 

-32096 

-34687 

43583 

-73877 

81571 

112101 

80909  \ 

-11076 

18149 

-18662 

-21852 

-16306 

-31085 

68156 

-56671 

-79481 

-54135 

-18550 

30003 

-34154 

-46882 

-32096 

-19161 

27269 

-37147 

-51673 

-34687 

14741 

-33620 

27054 

45347 

29211 

-33620 

76423 

-45782 

-86804 

-58650 

27054 

-45782 

49169 

75557 

50169 

45347 

-86804 

75557 

119634 

81258 

29211 

-58650 

50169 

81258 

53286  / 
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(b)  The  first  two  columns  of  the  matrix  Z  are  given  by 


(c) 

15.13  (a) 


(b) 

15.14  (a) 


City 

Zl 

Z2 

City 

Zl 

Z2 

A 

354.1 

-10.2 

M 

391.6 

47.5 

B 

-77.1 

25.0 

N 

21.0 

-44.7 

C 

-238.2 

-75.7 

O 

9.8 

30.9 

D 

-154.9 

65.9 

P 

-173.8 

-78.5 

E 

-163.2 

72.2 

Q 

6.3 

17.1 

F 

126.0 

24.9 

R 

117.0 

-48.0 

G 

-228.8 

-149.4 

S 

-102.3 

-170.2 

H 

223.9 

1.5 

T 

-53.2 

-27.3 

I 

337.7 

44.8 

U 

-315.2 

190.9 

J 

226.7 

34.7 

V 

-255.7 

140.2 

K 

-33.4 

22.3 

w 

-19.3 

-34.3 

L 

1.1 

-79.4 

The  metric  multidimensional  scaling  plot  shows  the  relative  positions  of 
the  cities. 

The  multidimensional  scaling  plot  shows  two  clusters,  one  for  positive 
values  of  the  first  dimension  and  one  for  negative  values.  The  two  clusters 
can  be  interpreted  as  comfort  (positive  values)  and  discomfort  (negative 
values).  Hence,  the  axis  of  the  first  dimension  can  be  interpreted  as  the 
level  of  comfort. 

The  dendrogram  for  Ward’s  method  clearly  shows  two  clusters,  the  same 
as  in  part  (a). 

The  initial  configuration  of  points  will  vary.  One  example  is  as  follows: 


yi 

T2 

T3 

T4 

V5 

,V6 

1.458 

.769 

-1.350 

.456 

-1.610 

1.827 

-.598 

-1.069 

-2.667 

.458 

.416 

1.094 

-1.777 

-.409 

.369 

.655 

-.058 

1.177 

.071 

.361 

1.157 

-.154 

.343 

-.417 

-.060 

1.361 

.743 

1.436 

.332 

-.894 

-.757 

-.432 

-.545 

.233 

.646 

-.102 

-1.971 

-.492 

-.461 

.078 

1.441 

.039 

-1.560 

-.173 

.657 

-.528 

1.001 

1.030 

-.597 

.814 

-.898 

.283 

-.355 

-1.115 

1.449 

-.942 

.867 

-.922 

.833 

1.196 

-1.809 

-.093 

-1.762 

-.533 

-1.136 

-.226 

1.067 

.199 

.978 

.884 

-1.060 

-.800 

(b)  Answers  will  vary.  For  the  seeds  given  in  part  (a),  STRESS  =  .0266. 

(c)  Answers  will  vary.  The  plot  of  STRESS  versus  k  for  one  solution  showed 
that  two  dimensions  should  be  retained.  The  nonmetric  MDS  plot  showed 
that  Franco,  Mussolini,  and  Hitler  were  close  together,  as  well  as  Churchill 
and  DeGaulle,  and  Eisenhower  and  Truman. 

(d)  Answers  will  vary.  One  solution  gave  results  similar  to  part  (c). 


668 


ANSWERS  AND  HINTS  TO  PROBLEMS 


(e)  Answers  will  vary.  One  solution  showed  three  dimensions.  A  plot  of  two 
dimensions  showed  Mussolini  and  Franco  together  in  the  center  with  the 
others  forming  a  circle  around  them  of  almost  equally  spaced  points. 

(f)  Answers  will  vary.  One  solution  was  similar  to  that  in  part  (c). 

15.15  (a)  The  correspondence  matrix  P  is  found  by  dividing  each  element  of  Table 
15.16  by  n  —  1281  to  obtain  the  following: 


Death 

Birth 

Jan. 

Feb. 

Mar. 

Apr. 

May 

Jun. 

Jul. 

Aug. 

Sep. 

Oct. 

Nov. 

Dec. 

Total 

Jan. 

.007 

.011 

.009 

.011 

.007 

.009 

.008 

.012 

.007 

.009 

.009 

.010 

.108 

Feb. 

.010 

.005 

.005 

.006 

.007 

.004 

.003 

.004 

.005 

.009 

.001 

.010 

.069 

Mar. 

.009 

.011 

.007 

.005 

.013 

.008 

.007 

.008 

.007 

.002 

.010 

.007 

.094 

Apr. 

.005 

.009 

.008 

.005 

.007 

.009 

.003 

.009 

.003 

.007 

.006 

.009 

.080 

May 

.006 

.005 

.009 

.005 

.003 

.009 

.007 

.007 

.009 

.005 

.007 

.003 

.075 

Jun. 

.011 

.004 

.004 

.005 

.010 

.004 

.005 

.003 

.006 

.007 

.005 

.004 

.069 

Jul. 

.009 

.008 

.010 

.003 

.004 

.009 

.005 

.005 

.003 

.008 

.003 

.006 

.073 

Aug. 

.005 

.005 

.009 

.010 

.008 

.007 

.002 

.006 

.006 

.006 

.006 

.009 

.081 

Sep. 

.005 

.009 

.009 

.008 

.008 

.009 

.003 

.006 

.009 

.005 

.006 

.005 

.083 

Oct. 

.012 

.006 

.009 

.007 

.005 

.008 

.009 

.006 

.007 

.006 

.005 

.005 

.087 

Nov. 

.005 

.007 

.012 

.008 

.009 

.008 

.005 

.008 

.005 

.008 

.007 

.005 

.087 

Dec. 

.005 

.014 

.007 

.009 

.011 

.006 

.007 

.007 

.008 

.005 

.008 

.006 

.092 

Total 

.092 

.094 

.096 

.084 

.092 

.088 

.066 

.080 

.077 

.075 

.074 

.081 

1.000 

(b)  The  R  matrix 

is  given  by 

/  .07 

.11 

.12 

.11 

.07 

.04 

.11 

.10 

.09 

.08 

.09 

.04  \ 

.13 

.08 

.12 

.07 

.07 

.03 

.09 

.11 

.10 

.08 

.08 

.08 

.09 

.08 

.07 

.15 

.05 

.08 

.07 

.08 

.12 

.08 

.05 

.08 

.09 

.06 

.15 

.08 

.15 

.04 

.06 

.07 

.10 

.01 

.12 

.08 

.10 

.11 

.09 

.10 

.07 

.07 

.08 

.09 

.07 

.08 

.08 

.07 

R  = 

.04 

.06 

.09 

.11 

.13 

.07 

.12 

.14 

.05 

.04 

.11 

.04 

.08 

.04 

.06 

.06 

.16 

.08 

.06 

.06 

.15 

.08 

.10 

.09 

.06 

.08 

.07 

.12 

.10 

.07 

.08 

.07 

.14 

.11 

.02 

.07 

.07 

.09 

.04 

.06 

.08 

.09 

.13 

.11 

.04 

.09 

.06 

.11 

.09 

.09 

.05 

.08 

.06 

.06 

.09 

.14 

.10 

.08 

.09 

.06 

.08 

.07 

.06 

.07 

.14 

.11 

.09 

.10 

.06 

.06 

.07 

.08 

\  .09 

.08 

.07 

.11 

.07 

.04 

.10 

.10 

.09 

.08 

.06 

.11  / 

and  the  C  matrix  is  given  by 


/  .07 

.11 

.12 

.09 

.06 

.05 

.10 

.08 

.08 

.08 

.09 

.04  \ 

.12 

.08 

.12 

.06 

.04 

.08 

.09 

.08 

.08 

.08 

.08 

.08 

.10 

.09 

.08 

.15 

.05 

.11 

.07 

.07 

.12 

.11 

.06 

.10 

.07 

.05 

.13 

.06 

.11 

.05 

.04 

.05 

.08 

.01 

.11 

.07 

.13 

.15 

.13 

.12 

.08 

.12 

.10 

.10 

.08 

.12 

.11 

.09 

.04 

.06 

.08 

.08 

.10 

.08 

.10 

.11 

.04 

.04 

.10 

.04 

.07 

.04 

.05 

.04 

.12 

.08 

.04 

.04 

.11 

.07 

.09 

.08 

.07 

.10 

.09 

.12 

.10 

.11 

.09 

.07 

.14 

.14 

.02 

.09 

.07 

.09 

.04 

.05 

.07 

.11 

.11 

.09 

.03 

.09 

.10 

.07 

.08 

.08 

.07 

.07 

.14 

.14 

.09 

.09 

.06 

.07 

.08 

.09 

\  .09 

.08 

.07 

.10 

.06 

.05 

.10 

.09 

.08 

.08 

.06 

.12  / 
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(c)  The  chi-square  statistic  is  117.7742  with  121  degrees  of  freedom,  which 
gives  a  /7-value  of  .5660.  The  two  variables  appear  to  be  independent. 

(d)  In  the  correspondence  plot,  the  following  associations  are  seen:  {November 
births,  June  deaths),  {March  deaths,  April  deaths,  January  births),  {September 
births,  February  deaths),  {August  births,  April  births),  {May  deaths, 
September  deaths,  May  births). 

15.16  (a)  The  correspondence  matrix  P  is  found  by  dividing  each  element  of  Table 
15.17  by  8193  to  obtain  the  following: 


Part  of  Country 

Burglary 

Fraud 

Vandalism  Total 

Oslo  area 

.048 

.300 

.215 

.563 

Mid  Norway 

.018 

.019 

.112 

.148 

North  Norway 

.085 

.040 

.164 

.289 

Total 

.151 

.358 

.491 

1.000 

The  R  matrix  is  given  by 

Part  of  Country 

Burglary 

Fraud 

Vandalism 

Oslo  area 

.086 

.533 

.381 

Mid  Norway 

.121 

.126 

.753 

North  Norway 

.293 

.138 

.569 

and  the  C  matrix 

is 

Part  of  Country 

Burglary 

Fraud 

Vandalism 

Oslo  area 

.320 

.837 

.437 

Mid  Norway 

.119 

.052 

.228 

North  Norway 

.561 

.111 

.335 

(c)  The  chi-square  statistic  is  1662.6  with  4  degrees  of  freedom,  which  gives 
a  p-value  less  than  .0001.  The  two  variables  are  dependent. 

(d)  In  the  correspondence  plot.  North  Norway  is  associated  with  burglaries, 
Oslo  is  associated  with  fraud,  and  Mid  Norway  is  associated  with  vandal¬ 
ism. 
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15.17  (a)  The  Burt  matrix  is  as  follows. 


No 

5254 

0 

564 

3408 

1282 

1830 

3424 

2466 

2788 

2190 

3064 

686 

2666 

1902 

Yes 

0 

165 

105 

42 

18 

73 

92 

37 

128 

4 

125 

26 

63 

76 

High  dust 

564 

105 

669 

0 

0 

402 

267 

62 

607 

218 

451 

87 

359 

223 

Low  dust 

3408 

42 

0 

3450 

0 

1056 

2394 

1642 

1808 

1446 

2004 

480 

1684 

1286 

Medium 

1282 

18 

0 

0 

1300 

445 

855 

799 

501 

566 

734 

145 

686 

469 

dust 

Race — 

1830 

73 

402 

1056 

445 

1930 

0 

932 

971 

799 

1104 

108 

1658 

137 

Other 

White 

3424 

92 

267 

2394 

855 

0 

3516 

1571 

1945 

1431 

2085 

604 

1071 

1841 

Female 

2466 

37 

62 

1642 

799 

932 

1571 

2503 

0 

1373 

1130 

266 

1421 

816 

Male 

2788 

128 

607 

1808 

501 

971 

1945 

0 

2916 

857 

2059 

446 

1308 

1162 

Nonsmoker 

2190 

40 

218 

1446 

566 

799 

1431 

1373 

857 

2230 

0 

231 

1142 

857 

Smoker 

3064 

125 

451 

2004 

734 

1104 

2085 

1130 

2059 

0 

3189 

481 

1587 

1121 

10-20 

686 

26 

87 

480 

145 

108 

604 

266 

446 

231 

481 

712 

0 

0 

<  to 

2666 

63 

359 

1684 

686 

1658 

1071 

1421 

1308 

1142 

1587 

0 

2729 

0 

>  20 

1902 

76 

223 

1286 

469 

137 

1841 

816 

1162 

857 

1121 

0 

0 

1978 

(b)  The  column  coordinates  for  the  plot  are  given  by 


Variables  yi  y2 


No 

-.032 

-.087 

Yes 

1.013 

2.761 

High  dust 

1.072 

1.648 

Low  dust 

-.209 

-.107 

Medium  dust 

.003 

-.564 

Race — Other 

1.184 

-.153 

White 

-.641 

.083 

Female 

.007 

-.791 

Male 

-.006 

.679 

Nonsmoker 

-.036 

-.592 

Smoker 

.025 

.414 

10-20 

-.605 

.535 

<  10 

.789 

-.300 

>  20 

-.871 

.221 

(c)  Some  associations  seen  in  the  plot  are  {byssinosis-yes,  high  dust},  {female, 
nonsmoker,  medium  dust},  {smoker,  male}. 
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15.18  (a)  The  two-dimensional  coordinates  of  the  observation  points  and  variable 
points  are 


Observation  Points 


Name 

Coordinate  1 

Coordinate  2 

Albania 

14.102 

-1.322 

Austria 

-5.461 

1.548 

Belgium 

-6.077 

-1.479 

Bulgaria 

26.116 

3.319 

Czech. 

3.317 

-2.092 

Denmark 

-13.861 

1.374 

E.  Germany 

-4.902 

-8.360 

Finland 

-12.262 

11.290 

France 

-6.345 

.672 

Greece 

9.036 

3.033 

Hungary 

10.805 

-2.363 

Ireland 

-11.857 

5.312 

Italy 

6.309 

-1.314 

Netherlands 

-11.809 

2.133 

Norway 

-11.005 

-.077 

Poland 

2.526 

2.999 

Portugal 

.784 

-16.753 

Romania 

19.067 

2.591 

Spain 

1.923 

-10.483 

Sweden 

-14.842 

.726 

Switzerland 

-9.068 

4.000 

UK 

-9.311 

.698 

USSR 

10.586 

4.355 

W.  Germany 

-13.514 

-3.353 

Yugoslavia 

25.742 

3.548 

Variable  Points 

Name 

Coordinate  1 

Coordinate  2 

R_MEAT 

-.151 

.133 

W_MEAT 

-.129 

.043 

EGGS 

-.067 

.021 

MILK 

-.425 

.831 

FISH 

-.127 

-.292 

CEREALS 

.861 

.406 

STARCHY 

-.067 

-.076 

NUTS 

.114 

-.070 

FRUT.VEG 

.020 

-.169 

In  the  biplot,  the  arrows  for  variables  are  too  short  to  pass  through  the 
points  for  observations. 
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(b)  The  two-dimensional  coordinates  of  the  observation  points  and  variable 
points  are  given  next. 


Observation  Points 


Name 

Coordinate  1 

Coordinate  2 

Albania 

.231 

-.049 

Austria 

-.089 

.057 

Belgium 

-.100 

-.055 

Bulgaria 

.428 

.122 

Czech. 

.054 

-.077 

Denmark 

-.227 

.051 

E.  Germany 

-.080 

-.308 

Finland 

-.201 

.416 

France 

-.104 

.025 

Greece 

.148 

.112 

Hungary 

.177 

-.087 

Ireland 

-.194 

.196 

Italy 

.103 

-.048 

Netherlands 

-.193 

.079 

Norway 

-.180 

-.003 

Poland 

.041 

.110 

Portugal 

.013 

-.617 

Romania 

.312 

.095 

Spain 

.032 

-.386 

Sweden 

-.243 

.027 

Switzerland 

-.149 

.147 

UK 

-.153 

.026 

USSR 

.173 

.160 

W.  Germany 

-.221 

-.124 

Yugoslavia 

.422 

.131 

Variable  Points 

Name 

Coordinate  1 

Coordinate  2 

RJV1EAT 

-9.196 

3.602 

W_MEAT 

-7.904 

1.179 

EGGS 

-4.106 

.569 

MILK 

-25.964 

22.552 

FISH 

-7.750 

-7.934 

CEREALS 

52.545 

11.025 

STARCHY 

-4.080 

-2.064 

NUTS 

6.953 

-1.902 

FRUT.VEG 

1.235 

-4.593 

In  the  biplot,  the  observation  points  are  tightly  clustered  around  the 
point  (0,  0),  making  them  difficult  to  distinguish, whereas  variable  points 
are  easily  discerned.  Red  meats,  white  meats,  and  milk  are  highly  posi- 


ANSWERS  AND  HINTS  TO  PROBLEMS 


673 


tively  correlated.  These  three  variables  are  negatively  correlated  with  nuts 
and  frut_veg. 

(c)  The  two-dimensional  coordinates  of  the  observation  points  and  variable 
points  are  as  follows. 


Observation  Points 


Name 

Coordinate  1 

Coordinate  2 

Albania 

1.805 

-.254 

Austria 

-.699 

.297 

Belgium 

-.778 

-.284 

Bulgaria 

3.343 

.637 

Czech. 

.425 

-.402 

Denmark 

-1.774 

.264 

E.  Germany 

-.627 

-1.605 

Finland 

-1.570 

2.167 

France 

-.812 

.129 

Greece 

1.157 

.582 

Hungary 

1.383 

-.454 

Ireland 

-1.518 

1.020 

Italy 

.808 

-.252 

Netherlands 

-1.511 

.409 

Norway 

-1.409 

-.015 

Poland 

.323 

.576 

Portugal 

.100 

-3.216 

Romania 

2.441 

.497 

Spain 

.246 

-2.012 

Sweden 

-1.900 

.139 

Switzerland 

-1.161 

.768 

UK 

-1.192 

.134 

USSR 

1.355 

.836 

W.  Germany 

-1.730 

-.644 

Yugoslavia 

3.295 

.681 

Variable  Points 

Name 

Coordinate  1 

Coordinate  2 

R_MEAT 

-1.177 

.691 

W_MEAT 

-1.012 

.226 

EGGS 

-.526 

.109 

MILK 

-3.323 

4.329 

FISH 

-.992 

-1.523 

CEREALS 

6.726 

2.116 

STARCHY 

-.522 

-.396 

NUTS 

.890 

-.365 

FRUIT.VEG 

.158 

-.882 
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In  the  biplot,  the  variable  points  and  observation  points  are  both  well 
spaced.  Finland  scored  high  on  the  milk  variable.  Yugoslavia  and  Bulgaria 
scored  high  on  the  cereal  variable.  Spain  and  Portugal  scored  highest  on 
the  fish  and  frut_veg  variables. 

(d)  The  biplot  from  part  (c)  seems  better  because  the  scales  on  the  variables 
and  points  are  more  evenly  matched. 

15.19  (a)  The  two-dimensional  coordinates  of  the  observation  points  and  variable 
points  are  as  follows. 


Observation  Points 


Name 

Coordinate  1 

Coordinate  2 

FSM1 

-9.535 

-4.752 

Sister 

2.705 

.796 

FSM2 

4.043 

-.584 

Father 

4.392 

.614 

Teacher 

-8.708 

5.008 

MSM 

3.409 

.701 

FSM3 

3.694 

-1.782 

Variable  Points 

Name 

Coordinate  1 

Coordinate  2 

KIND 

.610 

-.054 

INTEL 

.085 

.413 

HAPPY 

.407 

-.456 

LIKE 

.621 

-.039 

JUST 

.264 

.785 

In  the  biplot,  the  arrows  for  the  variables  are  too  short  to  pass  through 
the  points  for  observations. 

(b)  The  two-dimensional  coordinates  of  the  observation  points  and  variable 
points  are  as  follows. 


Observation  Points 


Name 

Coordinate  1 

Coordinate  2 

FSM1 

-.622 

-.655 

Sister 

.176 

.110 

FSM2 

.264 

-.080 

Father 

.287 

.085 

Teacher 

-.568 

.690 

MSM 

.222 

.097 

FSM3 

.241 

-.246 
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Variable  Points 


Name 

Coordinate  1 

Coordinate  2 

KIND 

9.345 

-.391 

INTEL 

1.298 

2.997 

HAPPY 

6.235 

-3.313 

LIKE 

9.521 

-.282 

JUST 

4.054 

5.700 

In  the  biplot,  the  observation  points  are  tightly  clustered  around  the 
point  (0,  0),  making  them  difficult  to  distinguish,  whereas  variable  points 
are  well  spaced.  Just  and  intelligent  are  highly  positively  correlated,  as  are 
kind  and  likeable. 

(c)  The  two-dimensional  coordinates  of  the  observation  points  and  variable 
points  are  as  follows. 


Observation  Points 


Name 

Coordinate  1 

Coordinate  2 

FSM1 

-2.435 

-1.764 

Sister 

.691 

.295 

FSM2 

1.033 

-.217 

Father 

1.122 

.228 

Teacher 

-2.224 

1.859 

MSM 

.871 

.260 

FSM3 

.943 

-.662 

Variable  Points 

Name 

Coordinate  1 

Coordinate  2 

KIND 

2.387 

-.145 

INTEL 

.331 

1.113 

HAPPY 

1.593 

-1.230 

LIKE 

2.432 

-.105 

JUST 

1.036 

2.116 

In  the  biplot,  the  variable  points  and  observation  points  are  both  well 
spaced.  Father,  sister,  FSM2,  and  FSM3  all  scored  high  on  the  kind,  like¬ 
able,  and  happy  variables,  whereas  teacher  and  FSM1  scored  negatively 
on  those  variables. 

(d)  The  biplot  from  part  (c)  seems  better  because  the  scales  on  the  variables 
and  points  are  more  evenly  matched. 
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15.20  (a)  The  two-dimensional  coordinates  of  the  observation  points  and  variable 
points  are  as  follows: 


Observation  Points 


Name 

Coordinate  1 

Coordinate  2 

1 

49.410 

-5.832 

2 

25.407 

-7.658 

3 

21.600 

-2.340 

4 

-23.545 

-6.367 

5 

-28.477 

-4.773 

6 

-33.341 

2.315 

7 

-28.176 

7.992 

8 

-25.786 

12.655 

9 

-29.703 

9.275 

10 

-33.868 

-3.776 

11 

-33.529 

-1.977 

12 

28.186 

-16.031 

13 

10.804 

-6.608 

14 

.566 

3.021 

15 

77.970 

.109 

16 

12.859 

16.294 

17 

41.960 

5.103 

18 

46.930 

19.064 

19 

34.958 

-1.018 

20 

-16.477 

1.148 

21 

-23.634 

-1.055 

22 

-34.036 

-2.424 

23 

20.632 

-5.882 

24 

-15.873 

-6.731 

25 

-23.023 

.745 

26 

-15.183 

-1.942 

27 

-11.903 

-6.917 

28 

5.273 

3.610 

Variable  Points 

Name 

Coordinate  1 

Coordinate  2 

North 

.526 

.225 

East 

.429 

.752 

South 

.579 

-.379 

West 

.452 

-.490 

In  the  biplot,  the  variable  points  are  tightly  grouped  and  the  correspond¬ 
ing  arrows  do  not  pass  through  the  points  for  observations. 
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(b)  The  two-dimensional  coordinates  of  the  observation  points  and  variable 
points  are  as  follows: 


Observation  Points 


Name 

Coordinate  1 

Coordinate  2 

1 

.303 

-.145 

2 

.156 

-.191 

3 

.132 

-.058 

4 

-.144 

-.158 

5 

-.175 

-.119 

6 

-.205 

.058 

7 

-.173 

.199 

8 

-.158 

.315 

9 

-.182 

.231 

10 

-.208 

-.094 

11 

-.206 

-.049 

12 

.173 

-.399 

13 

.066 

-.164 

14 

.003 

.075 

15 

.478 

.003 

16 

.079 

.406 

17 

.257 

.127 

18 

.288 

.474 

19 

.214 

-.025 

20 

-.101 

.029 

21 

-.145 

-.026 

22 

-.209 

-.060 

23 

.127 

-.146 

24 

-.097 

-.168 

25 

-.141 

.019 

26 

-.093 

-.048 

27 

-.073 

-.172 

28 

.032 

.090 

Variable  Points 

Name 

Coordinate  1 

Coordinate  2 

North 

85.779 

9.026 

East 

69.899 

30.223 

South 

94.377 

-15.213 

West 

73.682 

-19.694 

In  the  biplot,  the  observation  points  are  tightly  clustered  around  the 
point  (0,  0),  making  them  difficult  to  distinguish,  whereas  variable  points 
are  well  spaced.  All  the  variables  are  positively  correlated,  with  south  and 
west  showing  the  closest  relationship. 
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(c)  The  two-dimensional  coordinates  of  the  observation  points  and  variable 
points  are  as  follows: 


Observation  Points 


Name 

Coordinate  1 

Coordinate  2 

1 

3.870 

-.920 

2 

1.990 

-1.208 

3 

1.692 

-.369 

4 

-1.844 

-1.004 

5 

-2.230 

-.753 

6 

-2.611 

.365 

7 

-2.207 

1.261 

8 

-2.020 

1.996 

9 

-2.326 

1.463 

10 

-2.652 

-.596 

11 

-2.626 

-.312 

12 

2.207 

-2.529 

13 

.846 

-1.042 

14 

.044 

.477 

15 

6.106 

.017 

16 

1.007 

2.571 

17 

3.286 

.805 

18 

3.675 

3.008 

19 

2.738 

-.161 

20 

-1.290 

.181 

21 

-1.851 

-.166 

22 

-2.666 

-.382 

23 

1.616 

-.928 

24 

-1.243 

-1.062 

25 

-1.803 

.118 

26 

-1.189 

-.306 

27 

-.932 

-1.091 

28 

.413 

.569 

Variable  Points 

Name 

Coordinate  1 

Coordinate  2 

North 

6.718 

1.424 

East 

5.474 

4.768 

South 

7.391 

-2.400 

West 

5.771 

-3.107 

In  the  biplot,  the  variable  points  and  observation  points  are  both  well 
spaced.  Tree  18  is  associated  with  east,  17  with  north,  1  and  3  with  south, 
and  2  and  23  with  west. 

(d)  The  biplot  from  part  (c)  seems  better  because  the  scales  on  the  variables 
and  points  are  more  evenly  matched. 


APPENDIX  C 


Data  Sets  and  SAS  Files 


Two  sets  of  files  are  located  on  the  ftp  server  of  John  Wiley  &  Sons  STM  (Scientific, 
Technical,  and  Medical)  Division. 

All  data  sets  in  the  book, 

SAS  command  files  for  all  numerical  examples. 


DOWNLOADING  DATA  SET  FILES  FROM  FTP  SERVER 

Please  read  the  message  in  the  multivariate  analysis  area  before  downloading  the 
files.  The  message  has  information  about  the  files  and  discusses  how  to  access  them. 

The  files  can  be  accessed  through  either  a  standard  ftp  program  or  a  Web  browser 
using  the  ftp  protocol.  To  gain  ftp  access,  type  the  following  at  your  ftp  command 
prompt: 


ftp://ftp.wiley.com 

The  files  are  located  in  the  multivariate  analysis  area  of  the  public/sci_teeh_med 
directory.  If  you  need  further  information  about  downloading  the  files,  you  can  reach 
Wiley’s  technical  support  line  at  212-850-6194. 
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Index 


Additional  information,  test  for,  136-139, 
231-233 
Air  pollution  data,  502 
Airline  distance  data,  508 
Algebra,  matrix,  see  Matrix,  algebra 
Analysis  of  variance: 

multivariate  (MANOVA): 

additional  information,  test  for,  231-233 
association,  measures  of,  173-176 
assumptions,  checking  on,  198-199 
and  canonical  correlation,  376-377 
contrasts,  180-183 
discriminant  function,  165,  184-185, 

191 

growth  curves,  221-230.  See  also  Growth 
curves;  Repeated  measures 
designs 

HandE  matrices,  160-161 
higher  order  models,  195-196 
individual  variables,  tests  on,  163-164, 
183-186 

discriminant  function,  184-185,  191 
experimentwise  error  rate,  183-185 
protected  tests,  1 84 
Lawley-Hotelling  test,  167 
table  of  critical  values,  524-528 
likelihood  ratio  test,  164 
mixed  models,  196-198 

expected  mean  squares,  196-197 
multivariate  association,  measures  of, 
173-176 
one-way,  158-161 
contrasts,  180-183 
orthogonal,  181 
model,  159 
unbalanced,  168 
Pillai’s  test,  166 


profile  analysis,  199-201 
repeated  measures,  204-221.  See  also 
Repeated  measures  designs; 
Growth  curves 

Roy’s  test  (union-intersection),  164-166 
table  of  critical  values,  517-520 
stepwise  discriminant  analysis,  233 
stepwise  selection  of  variables,  233 
test  for  additional  information,  231-233 
test  statistics,  161-173 
comparison  of,  169-170,  176-178 
and  eigenvalues,  168 
power  of,  176-178 
and  T1,  169 

tests  on  individual  variables,  163-174, 
183-186,  191 

discriminant  function,  165,  184-185,  191 
experimentwise  error  rate,  183-185 
protected  tests,  1 84 
test  on  a  subvector,  231-233 
two-way,  188-195 
contrasts,  190-191 
discriminant  function,  191 
interactions,  189-190 
main  effects,  189-190 
model,  189 

tests  on  individual  variables,  191 
test  statistics,  190 
unbalanced  one-way,  168 
union-intersection  test,  164 
Wilks’  A  (likelihood  ratio)  test,  161-164 
chi-square  approximation,  162 
F  approximation,  162 
partial  A -statistic,  232 
properties  of,  162-164 
table  of  critical  values,  501-516 
transformations  to  exact  F,  162-163 
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Analysis  of  variance  ( cont .) 
univariate  (ANOVA): 
one-way,  156-158 
contrasts,  178-180 
orthogonal,  179-180 
SSH,  SSE,  F-statistic,  158 
two-way,  186-188 
contrasts,  188 
F-test,  188 
interaction,  187 
main  effects,  187-188 
model,  186 

ANOVA,  see  Analysis  of  variance,  univariate 
Association,  measures  of,  173-176,  349-351 
Athletic  record  data,  480 

Bar  steel  data,  192 
Beetles  data,  150 
Bilinear  form,  19-20 
Biplots,  531-539 
coordinates  of  points,  533-534 
correlation,  534 
cosine,  534,  537 

points  for  observations,  531-534 
points  for  variables,  531-534 
principal  component  approach,  531-532,  535 
singular  value  decomposition,  532-533,  535 
Birth  and  death  data,  543 
Bivariate  normal  distribution,  46,  84,  88-89,  133 
Blood  data,  237 
Blood  pressure  data,  245 
Bonferroni  critical  values,  127 
table,  562-565 
Box’s  M-test,  257-259 
table  of  exact  critical  values,  588-589 
Bronchus  data,  154 
Burt  matrix,  526-529 
Byssinosis  data,  545-546 

Calcium  data,  56 
Calculator  speed  data,  210 
Canonical  correlation(s),  174,  260,  361-378 
canonical  variates,  see  Canonical  variates 
definition  of,  362-364 
and  discriminant  analysis,  376-378 
and  eigenvalues,  362-363,  377-378 
with  grouping  variables,  174 
and  MANOVA,  376-378 
dummy  variables,  376-377 
and  measures  of  association,  362,  373-374 
and  multiple  correlation,  361-362,  366,  376 
properties  of,  366-367 
redundancy  analysis,  373-374 
and  regression,  368-369,  374-376 


subset  selection,  376 

with  test  for  independence  of  two  subvectors, 
260,  367-368 

tests  of  significance,  367-37 1 
all  canonical  correlations,  367-369 
comparison  of  tests,  368-369 
and  test  of  independence,  367-368 
and  test  of  overall  regression,  367-368, 
375 

subset  of  canonical  correlations,  369-371 
subset  selection,  376 
test  of  a  subset  in  regression,  375-376 
Canonical  variates: 

correlations  among,  364 
definition  of,  363 
interpretation,  371-374 
by  correlations  (structure  coefficients), 

373 

by  rotation,  373 

by  standardized  coefficients,  371-373 
redundancy  analysis,  373-374 
and  regression,  374-376 
standardized  coefficients,  365,  371-373 
Categorical  variables,  see  Dummy  variables 
Central  Limit  Theorem  (Multivariate),  91 
Characteristic  form: 
of  t -statistic,  117,  122 
of  T2 -statistic,  118,  123 
Characteristic  roots,  see  Eigenvalues 
Chemical  data,  340 

Chi-square  distribution,  86,  91-92,  114 
Cholesky  decomposition,  25-26 
City  crime  data,  456 

Classification  analysis  (allocation),  299-321 
assigning  a  sampling  unit  to  a  group,  299 
asymptotic  optimality,  302 
correct  classification  rates,  307-309 
error  rates,  307-313  See  also  Error  rates 
estimates  of,  307-313 
as  a  stopping  rule,  311-313 
A:-nearest  neighbor  rule,  318-319 
nonparametric  classification  procedures,  302, 
314-320 

density  estimators  (kernel),  315-317 
multinomial  data  (categorical  variables), 
314-315 

dummy  variables,  315 
nearest  neighbor  rule,  318-320 
^-nearest  neighbor  rule,  318-319 
several  groups,  304-307 

linear  classification  functions,  304-306 
equal  covariance  matrices,  304-305 
optimal  classification  rule  (Welch),  305 
prior  probabilities,  305-307 
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quadratic  classification  functions,  306-307 
unequal  covariance  matrices,  306 
subset  selection,  311-313 

stepwise  discriminant  analysis,  311-313 
error  rate  as  a  stopping  rule,  311-313 
two  groups,  300-303 

Fisher’s  classification  function,  300-302 
linear  classification  function,  301-302 
optimal  classification  rule  (Welch),  302 
prior  probabilities,  302 
Cluster  analysis,  451-503 
average  linkage  method,  463 
centroid  method,  463—465 
choosing  the  number  of  clusters,  494^196 
and  classification,  45 1 
clustering  observations,  45 1 — 496 
clustering  variables,  451,  497-499 
comparison  of  methods,  478—479 
complete  linkage  method,  459^462 
definition,  45 1 
dendrogram,  456 
crossover,  47 1 

examples  of,  458—459,  461^462,  464-465, 
467,  469,  472^173,  476^177 
inversion,  47 1 
reversal,  47 1 
dissimilarity,  452 
distance,  451^-54 
distance  matrix,  453 
Euclidean  distance,  452 
Minkowski  metric,  453 
profile  of  observation  vector: 
level,  454 
shape,  454 
variation,  454 

scale  of  measurement,  453-454 
statistical  distance,  452-453 
farthest  neighbor  method,  see  complete 
linkage  method 
flexible  beta  method,  468^-7 1 
hierarchical  clustering,  452,  455 — 48 1 
agglomerative  method,  455-479 
average  linkage,  463 
centroid,  463^-65 
mean  vectors,  463 
complete  linkage,  459^462 
flexible  beta,  468^-7 1 
median,  466 
single  linkage,  456 — 459 
Ward’s  method,  466^468 
comparison  of  methods,  478^479 
dendrogram,  456 
divisive  method,  455,  479-481 
monothetic,  479 


polythetic,  479^480 
properties,  471-479 
chaining,  474 
contraction,  474 
dilation,  474 
monotonicity,  47 1 
outliers,  478^479 
space  contracting,  474 
space  dilating,  474 
ultrametric,  47 1 

incremental  sum  of  squares  method,  see 
Ward’s  method 
median  method,  466 

nearest  neighbor  method,  see  single  linkage 
method 

nonhierarchical  methods,  481-494 
density  estimation,  493 
dense  point,  493 
modes,  493 

mixtures  of  distributions,  490-492 
partitioning,  481^-90 
&-means,  482 — 488 
seeds,  482^-87 

methods  based  on  E  and  H,  488-490 
number  of  clusters: 

choosing  the  number  of  clusters,  494—496 
cutting  the  dendrogram,  494—495 
methods  based  on  E  and  H,  495^196 
total  possible  number,  455 
optimization  methods,  see  nonhierarchical 
methods,  partitioning 
partitioning,  452,  481^190 
plotting  of  clusters: 

discriminant  functions,  486^488,  494 
principal  components,  45 1 ,  484 
projection  pursuit,  45 1 
profile  of  observation  vector: 
level,  454 
shape,  454 
variation,  454 
similarity,  451-455 
single  linkage  method,  456—459 
tree  diagram,  see  dendrogram 
validity  of  a  cluster  solution,  496 
cross  validation,  496 
hypothesis  test,  496 
variables,  clustering  of,  451,  497^199 
correlations,  497 
and  factor  analysis,  498 
Ward’s  method,  466^168 
Coated  pipe  data,  135 
Coefficient  of  determination,  see  R2 
Commensurate  variables,  see  Variables, 
commensurate 
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Communality,  see  Factor  analysis 
Confidence  interval  (reference),  119,  127 
Contingency  table: 

graphical  analysis  of,  see  Correspondence 
analysis 

higher-way  table,  526,  528-529 
two-way  table,  514-516,  519,  521 
Contour  plots,  84-85 
Contrast(s): 

contrast  matrices  in  growth  curves,  222-225, 
227-230 

contrast  matrices  in  repeated  measures,  206, 
208-221 

one-sample  profile  analysis,  141-142 
one-way  ANOVA,  178-180 
one-way  MANOVA,  180-183 
orthonormal,  206 
two-way  ANOVA,  188 
two-way  MANOVA,  190-191 
Cork  data,  239 

Correct  classification  rate,  307-309 
Correlation: 

canonical,  see  Canonical  correlation(s) 
and  cosine  of  angle  between  two  vectors, 
49-50 

intra-class  correlation,  198-199 
and  law  of  cosines,  49-50 
multiple,  see  Multiple  correlation 
and  orthogonality  of  two  vectors,  50 
population  correlation  (p),  49 
sample  correlation  (r),  49 
of  two  linear  combinations,  67,  72-73 
Correlation  matrix: 

and  covariance  matrix,  61 
factor  analysis  on,  4 1 8 — 4 1 9 
partitioned,  365 

population  correlation  matrix,  61 
principal  components  from,  383-384, 
393-397 

sample  correlation  matrix,  60-61 
from  covariance  matrix,  61 
from  data,  60 

Correspondence  analysis,  514-530 
contingency  table: 

higher-way  table,  526,  528-529 
two-way  table,  514-516,  519,  521 
coordinates  for  row  and  column  points, 
521-525 

distances  between  column  points,  523-524 
distances  between  row  points,  523-524 
singular  value  decomposition,  522 
generalized  singular  value 
decomposition,  522 
correspondence  matrix,  515-516 


definition  (graph  of  contingency  table), 
514-515 

independence  of  rows  and  columns,  testing, 

519- 521 

chi-square,  515,  520-521 
in  terms  of  frequencies,  520 
in  terms  of  inertia,  521 
in  terms  of  relative  frequencies,  520 
in  terms  of  row  and  column  profiles, 

520- 521 
inertia,  515,  524 

multiple  correspondence  analysis,  526-530 
Burt  matrix,  526-529 
indicator  matrix,  526-527 
profiles  of  rows  and  columns,  515-519 
rows  and  columns,  514—525 
association,  514 
inertia,  515,  524 
interaction,  514-515 
points  for  plotting,  521-525 
profiles,  515-519 

singular  value  decomposition,  522,  524 
generalized  singular  value  decomposition, 
522 

Covariance: 

and  independence,  46^47 
and  orthogonality,  47^48 
population  covariance  (< axy ),  46^-7 
sample  covariance  (j5xy),  46^48 
expected  value  of,  47 
and  linear  relationships,  47 
of  two  linear  combinations,  67-68,  72 
Covariance  matrix: 

compound  symmetry,  206 
and  correlation  matrix,  61 
of  linear  combinations  of  variables,  69-73 
partitioned,  62-66,  362 

dependence  of  y  and  x  and  cov(y,  x),  63 


difference  between  cov 


and 


cov(y,  x),  63 

three  or  more  subsets,  64-66 
pooled  covariance  matrix,  122-123 
population  covariance  matrix  (X),  58-59 
sample  covariance  matrix  (S),  57-60 
from  data  matrix,  58 
distribution  of,  91-92 
Wishart  distribution,  91-92 
from  observations,  57-58 
positive  definiteness  of,  67 
and  sample  mean  vector,  independence  of, 
92 

sphericity,  206,  250-252 
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tests  on,  248-268.  See  also  Tests  of 

hypotheses,  covariance  matrices 
unbiasedness  of,  59 
uniformity,  206,  252-254 

Cross  validation,  310-311 

Cyclical  data,  153 

Data  matrix  (Y),  55 

Data  sets: 

air  pollution  data,  502 
airline  distance  data,  508 
athletic  record  data,  480 
bar  steel  data,  192 
beetles  data,  150 
birth  and  death  data,  543 
blood  data,  237 
blood  pressure  data,  245 
bronchus  data,  154 
byssinosis  data,  545-546 
calcium  data,  56 
calculator  speed  data,  210 
chemical  data,  340 
city  crime  data,  456 
coated  pipe  data,  135 
cork  data,  239 
cyclical  data,  153 
dental  data,  227 
diabetes  data,  65 
dogs  data,  243-244 
do-it-yourself  data,  529 
dystrophy  data,  152 
engineer  data,  151 
fabric  wear  data,  238 
fish  data,  235 
football  data,  280-28 1 
glucose  data,  80-8 1 
guinea  pig  data,  201 
height- weight  data,  45 
hematology  data,  109-110 
mandible  data,  247 
mice  data,  241 
Norway  crime  data,  544 
people  data,  526 
perception  data,  419 
plasma  data,  246 
piston  ring  data,  518 
politics  data,  542 
probe  word  data,  70 
protein  data,  483 
psychological  data,  125 
ramus  bone  data,  78 
repeated  data,  218 
Republican  vote  data,  53 
road  distance  data,  541 


rootstock  data,  171 
Seishu  data,  263 
snapbean  data,  236 
sons  data,  79 
steel  data,  273 
survival  data,  239-241 
temperature  data,  269 
trout  data,  242 
voting  data,  512 
weight  gain  data,  243 
wheat  data,  503 
words  data,  154 

Data,  types  of,  3^1.  See  also  Multivariate  data 
Density  function,  43 
Dental  data,  227 
Descriptive  statistics,  2 
Determinant,  26-29 
definition  of,  26-27 
of  diagonal  matrix,  27 
of  inverse,  29 
of  nonsingular  matrix,  28 
of  partitioned  matrix,  29 
of  positive  definite  matrix,  28 
of  product,  28 

as  product  of  eigenvalues,  34 
of  scalar  multiple  of  a  matrix,  28 
of  singular  matrix,  28 
Diabetes  data,  65 
Diagonal  matrix,  8 

Discriminant  analysis  (descriptive),  270-296 
and  canonical  correlation,  282,  376-378 
and  classification  analysis,  270 
discriminant  functions: 

for  several  groups,  165,  184-185,  191, 
277-279 

measures  of  association  for,  282 
for  two  groups,  126-132,  271-275 
and  distance,  272 
and  eigenvalues,  278-279 
interpretation  of  discriminant  functions, 
288-291 

correlations  (structure  coefficients),  291 
partial  F-values,  290 
rotation,  291 

standardized  coefficients,  289 
purposes  of,  277 
scatter  plots,  291-293 
selection  of  variables,  233,  293-296 
several  groups,  277-279 
standardized  discriminant  functions,  282-284 
stepwise  discriminant  analysis,  233,  293-296 
tests  of  significance,  284-288 
two  groups,  271-275 

and  multiple  regression,  130-132,  275-276 
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Discriminant  analysis  (predictive),  see 
Classification  analysis 
Dispersion  matrix,  see  Covariance  matrix 
Distance  between  vectors,  76-77,  83,  115,  118, 
123,  271-272 

Distribution: 
beta,  97 

bivariate  normal,  46,  84,  88-89 
chi-square,  86 
elliptically  symmetric,  103 
F,  119,  138,  158,  162-163,  179,  254-255 
multivariate  normal,  see  Multivariate  normal 
distribution;  Multivariate 
normality,  tests  for 
univariate  normal,  82-83,  86 

tests  for,  see  Univariate  normality,  tests  for 
Wishart,  91-92 
Dogs  data,  243-244 
Do-it-yourself  data,  529 
Dummy  variables,  173-174,  282,  315, 

376-377 

Dystrophy  data,  152 

E  matrix,  160-161,  339,  342-344 
Eigenvalues,  32-37,  168,  362-365,  382-384, 
397-398,  416-419,  422-423 
Eigenvectors,  32-35,  363-365,  382-384, 

397-398,  416-418,  420-422 
Elliptically  symmetric  distribution,  103 
EM  algorithm,  75,  491 
Engineer  data,  151 
Error  rate(s),  307-313 
actual  error  rate,  308 
apparent  error  rate,  307 
bias  in,  308,  309-311 
classification  table,  307-308 
cross  validation,  310-311 
experimentwise  error  rate,  1-2,  128-129, 
183-185 

holdout  method,  3 1 0-3 11,318 
leaving-one-out  method,  3 1 0-3 11,318 
partitioning  the  sample,  310 
resubstitution,  307-308 
Expected  value: 

of  random  matrix,  59 

of  random  vector  [F(y)],  55-56 

of  sample  covariance  matrix  [F(S)],  59 

of  sample  mean  [F(y)],  44 

of  sample  mean  vector  [F(y)],  56 

of  sample  variance  [£(j2)],  44 

of  sum  or  product  of  random  variables, 

46 

of  univariate  random  variable  [F(y)],  43 
Experimental  units,  1 


F-test(s): 

ANOVA,  158,  188 

between- subjects  tests  in  repeated  measures, 
212,216,  221 

comparing  two  variances,  254-255 
contrasts,  179 

equivalent  to  T 2,  119,  124,  137-138 
in  multiple  regression,  138,  330-332 
partial  F-test,  127,  138,  232,  293-296 
stepwise  selection,  233,  293-296,  336 
test  for  additional  information,  137 
test  for  individual  variables  in  MANOVA, 
183-186 

Wilks’  A: 

exact  F  transformation  for,  162-163 
F  approximation  for,  162-163 
Fabric  wear  data,  238 
Factor  analysis,  408—450 
assumptions,  410-412 
failure  of  assumptions,  consequences  of, 
414,  444-445 
common  factors,  409 

communalities,  413,  418,  422—423,  427^-28 
estimation  of,  418,  422,  424,  428 
eigenvalues,  416—419,  422—423,  427,  442, 
446 

eigenvectors,  416^-18,  420,  422 
factor  scores,  438—443 
averaging  method,  440 
regression  method,  439^440 
factors,  408^-14 
common,  409 
definition  of,  408—409 
interpretation  of,  409,  438 
number  of,  426^-30 
Heywood  case,  424—425 
loadings: 

definition  of,  409 
estimation  of,  415—426 

comparison  of  methods,  424 
fit  of  the  model,  419 
iterated  principal  factor  method, 
424-425 

Heywood  case,  424-425 
maximum  likelihood  method,  425-426 
principal  component  method,  415—421 
principal  factor  method,  421-424 
from  S  or  R,  418^419,  421^122 
model,  409-414 

modeling  covariances  or  correlations,  408, 
410,412,414,417 
number  of  factors  to  retain,  426-430 
average  eigenvalue,  427—428 
comparison  of  methods,  428-430 
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hypothesis  test,  427-428 
indeterminacy  of  for  certain  data  sets, 
428^29 
scree  plot,  427^-28 
variance  accounted  for,  427^428 
orthogonal  factors,  409^415,  431^-35 
and  principal  components,  408-409,  447^448 
and  regression,  410,  439^-40 
rotation,  414-415,  417,  430-437 
complexity  of  the  variables,  43 1 
interpretation  of  factors,  409,  438 
oblique  rotation,  431,  435—437 
and  orthogonality,  437 
pattern  matrix,  436 
orthogonal  rotation,  431—435 
analytical,  434 
communalities,  415,  431 
graphical,  431^433 
varimax,  434—435 
simple  structure,  43 1 
scree  plot,  427—428 
simple  structure,  43 1 
singular  matrix  and,  422 
specific  variance,  410,  417 
specificity,  see  specific  variance 
total  variance,  418^419,  427 
validity  of  factor  analysis  model,  443—147 
how  well  model  fits  the  data,  419,  444 
measure  of  sampling  adequacy,  445 
variance  due  to  a  factor,  418—419 
Fish  data,  235 

Fisher’s  classification  function,  300-302 
Football  data,  280-28 1 

Gauss-Markov  theorem,  341 
Generalized  population  variance,  83-85,  105 
Generalized  sample  variance,  73 
total  sample  variance,  73,  383,  409,  418, 

427 

Generalized  singular  value  decomposition,  522 
Geometric  mean,  174 
Glucose  data,  80 

Graphical  display  of  multivariate  data,  52-53 
Graphical  procedures,  504-547 
biplots,  see  Biplots 

correspondence  analysis,  see  Correspondence 
analysis 

multidimensional  scaling,  see 

Multidimensional  scaling 
Growth  curves,  221-230 
contrast  matrices,  222-225,  227-230 
one  sample,  221-229 
orthogonal  polynomials,  222-225 
polynomial  function  of  t,  225-227 


several  samples,  229-230 
unequally  spaced  time  points,  225-227 
Guinea  pig  data,  201 

H  matrix,  160-161,  343-344 
Height-weight  data,  45 
Hematology  data,  109-110 
Hierarchical  clustering,  see  Cluster  analysis, 
hierarchical  clustering 
Hotelling-Lawley  test  statistic,  see 

Fawley-Hotelling  test  statistic 
Hotelling’s  T2 -statistic,  see  T2-statistic 
Hyperellipsoid,  73 

Hypothesis  tests,  see  Tests  of  hypotheses 

Identity  matrix,  8 
Imputation,  74 

Independence  of  variables,  test  for, 

265-266 

table  of  exact  critical  values,  590 
Indicator  variables,  see  Dummy  variables 
Inferential  statistics,  2 
Intra-class  correlation,  198-199 

Kernel  density  estimators,  315-317 
Kurtosis,  94-95,  98-99,  103-104 

Largest  root  test,  see  Roy’s  test  statistic 
Latent  roots,  see  Eigenvalues 
Lawley-Hotelling  test  statistic: 
definition  of,  167 
table  of  critical  values,  582-586 
Length  of  vector,  14 
Likelihood  function,  90 
Likelihood  ratio  test(s): 

for  covariance  matrices,  248-250,  253,  256, 
260,  262,  265 
in  factor  analysis,  428 
for  mean  vectors,  126,  164 
Linear  classification  functions,  301-306 
Linear  combination  of  matrices,  19 
Linear  combination(s)  of  variables,  2,  67-73, 

113 

correlation  matrix  for  several  linear 
combinations,  72 

correlation  of  two  linear  combinations,  67, 
71-73 

covariance  matrix  for  several  linear 

combinations,  69-70,  72-73 
covariance  of  two  linear  combinations,  67-68, 
71-72 
distribution  of,  86 

mean  of  a  single  linear  combination,  67, 
71-72 
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Linear  combination(s)  of  variables  ( cont .) 
mean  vector  for  several  linear  combinations, 
69 

variance  of  a  single  linear  combination,  67, 
71-72 

Linear  combination  of  vectors,  19 
Linear  hypotheses,  141-142,  199-201, 

208-225 

Mahalanobis  distance,  76-77,  83 
Mandible  data,  247 

MANOVA,  130,  158.  See  also  Analysis  of 
variance,  multivariate 
Matrix  (matrices): 
algebra  of,  5-37 
bilinear  form,  19-20 
Burt  matrix,  526-529 
Cholesky  decomposition,  25-26 
conformable,  11 
covariance  matrix,  57-59 
definition,  5-6 

determinant,  26-29,  34.  See  also  Determinant 
of  diagonal  matrix,  27 
of  inverse  matrix,  29 
of  partitioned  matrix,  29 
of  positive  definite  matrix,  28 
of  product,  28 

of  scalar  multiple  of  a  matrix,  28 
of  singular  matrix,  28 
of  transpose,  29 
diagonal,  8 

eigenvalues,  32-37.  See  also  Eigenvalues 
characteristic  equation,  32 
and  determinant,  34 
of  1  +  A,  33 
of  inverse  matrix,  36 
of  positive  definite  matrix,  34 
Perron-Frobenius  theorem,  34 
square  root  matrix,  36 
of  product,  35 

singular  value  decomposition,  36 
of  square  matrix,  36 
of  symmetric  matrix,  35 
spectral  decomposition,  35 
and  trace,  34 

eigenvectors,  32-37.  See  also  Eigenvectors 
equality  of,  7 
identity,  8 

indicator  matrix,  526—527 
inverse,  23-25 

of  partitioned  matrix,  25 
of  product,  24 
of  transpose,  24 
j  vector,  9 


J  matrix,  9 

linear  combination  of,  19 
nonsingular  matrix,  23 
notation  for  matrix  and  vector,  5-6 
O  (zero  matrix),  9 
operations  with,  9-20 
distributive  law,  12 
factoring,  12-13,  15 
product,  1 1-20,  23-25 
conformable,  11 
with  diagonal  matrix,  18 
distributive  over  addition,  12 
and  eigenvalues,  34-35 
of  matrix  and  scalar,  19 
of  matrix  and  transpose,  16-18 
of  matrix  and  vector,  12-13,  16,  21 
as  linear  combination,  21 
noncommutativity  of,  1 1 
product  equal  zero,  23 
transpose  of,  12 
triple  product,  13 
of  vectors,  14 
sum,  10 

commutativity  of,  10 
orthogonal,  31 
rotation  of  axes,  3 1-32 
partitioned  matrices,  20-22 
determinant  of,  29 
inverse  of,  25 
product  of,  20-21 
transpose  of,  22 

Perron-Frobenius  theorem,  34,  402 
positive  definite,  25,  34 
positive  semidefinite,  25,  34 
quadratic  form,  19 
rank,  22-23 
full  rank,  22 
scalar,  6 

product  of  scalar  and  matrix,  19 
singular  matrix,  24 
singular  value  decomposition,  36 
size  of  a  matrix,  6 
spectral  decomposition,  35 
square  root  matrix,  36 
sum  of  products  in  vector  notation,  14 
sum  of  squares  in  vector  notation,  14 
symmetric,  7,  35 
trace,  30,  34,  69 
and  eigenvalues,  34 
of  product,  30 
of  sum,  30 
transpose,  6-7 
of  product,  12 
of  sum,  10 
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triangular,  8 
vectors,  see  Vector(s) 
zero  matrix  (O)  and  zero  vector  (0),  9 
Maximum  likelihood  estimation,  90-9 1 
of  correlation  matrix,  9 1 
of  covariance  matrix,  90 
likelihood  function,  90 
of  mean  vector,  90-91 
multivariate  normal,  90 
Mean: 

geometric,  174 
of  linear  function,  67,  72 
population  mean  (//),  43 
of  product,  46 
sample  mean  (7),  43 — 44 
of  sum,  46 

Mean  vector,  54-56,  83,  90-92 
notation,  54 

population  mean  vector  (ft),  55-56 
sample  mean  vector  (y),  54-56 
from  data  matrix,  55 
distribution  of,  9 1 
and  sample  covariance  matrix, 
independence  of,  92 
Measurement  scale,  2 
interval  scale,  2 
ordinal  scale,  2 
ratio  scale,  2 
Mice  data,  241 

Misclassihcation  rates,  see  Error  rate(s) 
Missing  values,  74-76 
Multicollinearity,  74,  84 
Multidimensional  scaling,  504-514 
classical  solution,  see  metric 

multidimensional  scaling 
definition,  504-505 
distances,  504-505 
seriation  (ranking),  504 
metric  multidimensional  scaling,  504-508 
algorithm  for  Ending  the  points, 
505-508 

and  principal  component  analysis,  506 
nonmetric  multidimensional  scaling,  505, 
508-514 

monotonic  regression,  509-510 
ranked  dissimilarities,  508-509 
STRESS,  510-512 

principal  coordinate  analysis,  see  metric 
multidimensional  scaling 
spectral  decomposition,  505-506 
Multiple  correlation,  332,  361-362,  423.  See 
also  R 2 

Multiple  correspondence  analysis,  526-530 
Burt  matrix,  526-529 


column  coordinates,  527,  5290530 
indicator  matrix,  526-527 
Multiple  regression,  see  Regression,  multiple 
Multivariate  analysis,  1 
descriptive  statistics,  1-2 
inferential  statistics,  2 

Multivariate  analysis  of  variance  (MANOVA), 
see  Analysis  of  variance, 
multivariate 
Multivariate  data: 
basic  types  of,  4 
plotting  of,  52-53 
sparceness  of,  97 
Multivariate  inference,  2 
Multivariate  normal  distribution,  82-105 
applicability  of,  85 
conditional  distribution,  88 
contour  plots,  84-85 
density  function,  83 
distribution  of  y  and  S,  91-92 
features  of,  82 
independence  of  y  and  S,  92 
linear  combinations  of,  86 
marginal  distribution,  87 
maximum  likelihood  estimates,  90-91.  See 
also  Maximum  likelihood 
estimation 
properties  of,  85-90 

quadratic  form  and  chi-square  distribution,  86 
standardized  variables,  86 
zero  covariance  matrix  implies  independence 
of  sub  vectors,  87 

Multivariate  normality,  tests  for,  92,  96-99 
Dr,  97-98,  102-103 
and  chi-square,  98 
table  of  critical  values,  557 
dynamic  plot,  98 
scatter  plots,  98,  105 

skewness  and  kurtosis,  multivariate,  98-99, 
103-104,  106 

table  of  critical  values,  553-556 
Multivariate  regression,  see  Regression, 
multivariate 

Nonsingular  matrix,  23 
Normal  distribution: 

bivariate  normal,  46,  84,  88-89,  133 
multivariate  normal,  see  Multivariate  normal 
distribution 

univariate  normal,  82-83,  86 
Normality,  tests  for,  see  Multivariate  normality; 

Univariate  normality 
Norway  crime  data,  544 
Numerical  taxonomy,  see  Cluster  analysis 
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Objectives  of  this  book,  3 
Observations,  1 

One-sample  test  for  a  mean  vector,  1 17-121 
Orthogonal  matrix,  31 
Orthogonal  polynomials,  222-225 
table  of,  587 
Orthogonal  vectors,  50 
Outliers: 
multivariate: 
kurtosis,  103-104 

elliptically  symmetric  distributions,  103 
principal  components,  389-392 
slippage  in  mean,  variance,  and  correlation, 
101 

Wilks’  statistic,  102-103 
univariate: 

accommodation,  100 
block  test,  101 
identification,  100 
masking,  101 

maximum  studentized  residual,  100-101 
skewness  and  kurtosis,  101 
slippage  in  mean  and  variance,  100 
swamping,  101 
Overall  variability,  73-74 

Paired  observation  test,  132-136 
Partial  F-tests,  127,  138,  232,  293-296 
Partitioned  matrices,  see  Matrix  (matrices), 
partitioned  matrices 

Partitioning,  see  Cluster  analysis,  partitioning 
Pattern  recognition,  see  Cluster  analysis 
People  data,  526 
Perception  data,  419 
Perron-Frobenius  theorem,  34,  402 
Pillai’s  test  statistic: 
definition  of,  166 
table  of  critical  values,  578-581 
Piston  ring  data,  518 
Plasma  data,  246 
Plotting  multivariate  data,  52-53 
Politics  data,  542 
Positive  definite  matrix,  25 

positive  definite  sample  covariance  matrix,  67 
Prerequisites  for  this  book,  3 
Principal  components,  380^407 
algebra  of,  385-387 
and  biplots,  531-532 

and  cluster  analysis,  390-393,  395,  482^-84, 
487 

component  scores,  386 
definition  of,  380,  382,  385 
dimension  reduction,  381-384,  385-387,  389 
eigenvalues  and  eigenvectors,  382-385, 
397-398 


major  axis,  384,  388 

and  factor  analysis,  403,  408-409,  447^448 
geometry  of,  381-385 
interpretation,  401^-04 
correlations,  403^104 
rotation,  403 

special  patterns  in  S  or  R,  401^403 
size  and  shape,  402^403 
large  variance  of  a  variable,  effect  of, 
383-384,  402 

last  few  principal  components,  382,  389,  401 
maximum  variance,  380,  385 
minimum  perpendicular  distances  to  line, 
387-388 

number  of  components  to  retain,  397 — 401 
orthogonality  of,  380,  383-384 
percent  of  variance,  383,  397 
and  perpendicular  regression,  385,  387-389 
plotting  of,  389-393 
assessing  normality,  389-390 
detection  of  outliers,  389-391 
properties  of,  381-386 
proportion  of  variance,  383 
robust,  389 

as  rotation  of  axes,  381-382,  384-385 
from  S  or  R,  383-384,  393-397 

nonuniqueness  of  components  from  R,  397 
sample  specific  components,  398 
scale  invariance,  lack  of,  383 
scree  graph,  397-399 
selection  of  variables,  404^406 
singular  matrix  and,  385-386 
size  and  shape,  402-403 
smaller  principal  components,  382,  389,  401 
tests  of  significance  for,  397,  399^-00 
variable  specific  components,  398 
variances  of,  382-383 
Probe  word  data,  70 
Product  notation  (]~ [),  10 
Profile,  139-140 

profile  of  observation  vector,  454 
Profile  analysis: 

and  contrasts,  141-142 
one-sample,  139-141 
and  one-way  ANOVA,  140 
profile,  definition  of,  139-140 
and  repeated  measures,  139 
several-sample,  199-204 
two-sample,  141-148 
hypotheses: 

flatness,  145-146,  199-201 
levels,  143-145,  199-200 
parallelism,  141-143,  199-200 
and  two-way  ANOVA,  143-145 
Projection  pursuit,  451 
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Protein  data,  483 
Psychological  data,  125 

Q-Q  plot,  92-94 

Quadratic  classification  functions,  306-307 
Quadratic  form,  19 
Quantiles,  92-94,  97 

R 2  (squared  multiple  correlation),  332-333, 
337,  349,  355,  361-362,  365, 
375-376,  422-123 
Ramus  bone  data,  78 
Random  variable(s): 
bivariate,  45 

bivariate  normal  distribution,  46,  84, 
88-89 

correlation  of,  49-50 
as  cosine,  49-50 
covariance  of,  46— f8 
linear  relationships,  47 
independent,  46 

test  for  independence,  265-266 
table  of  exact  critical  values,  590 
orthogonal,  47^18 
scatter  plot,  50-5 1 
linear  combinations,  see  Linear 

combination(s)  of  variables 
univariate,  43 

expected  value  of,  43 
mean  of,  43 
variance  of,  44 
vector,  53-56 
Random  vector(s),  52-56 
distance  between,  76-77,  83,  115,  118,  123, 
271-272 

linear  functions  of,  66-73.  See  also  Linear 
combination(s)  of  variables 
mean  of,  54-56 

partitioned  random  vector,  62-66 
standardized,  86 
sub  vectors,  62-66 
Rank  of  a  matrix,  22-23 
Rao’s  paradox,  116 
Redundancy  analysis,  373-374 
Regression,  monotonic,  509-510 
Regression,  multiple  (one  y  and  several  x’s), 
130-132,  323-337.  See  also 
Regression,  multivariate 
centered  x’s,  327-329 
estimation  of  /3 : 

centered  x’s,  327-328 
covariances,  328-329 
least  squares,  325-326 
estimation  of  a2,  326-327 
fixed  x’s,  323-333 


model,  323-324 
assumptions,  323-324 
corrected  for  means  (centered),  327 
multiple  correlation,  332 
R 2  (squared  multiple  correlation),  332-333, 
337,  349,  355.  See  also  R2 
random  x’s,  322-323,  337 
regression  coefficients,  323 
SSE,  325-326,  330-331,  333-336,  456 
SSR,  330-331 
subset  selection,  333-337 
all  possible  subsets,  333-335 

criteria  for  selection  (R2,s2,Cp), 
333-335 

comparison  of  criteria,  335 
stepwise  selection,  335-337 
tests  of  hypotheses,  329-332 
full  and  reduced  model,  330-332 
partial  F-test,  331-332 
overall  regression  test,  329-330 
subset  of  the  /3’s,  330-332 
variables: 

dependent  (y),  322 
independent  (x),  322 
predictor  (x),  322 
response  (y),  322 

Regression,  multivariate  (several  y’s  and  several 
x’s),  322-323,  337-358 
association,  measures  of,  349-35 1 
centered  x’s,  342-343 
estimation  of  B  (matrix  of  regression 
coefficients): 
centered  x’s,  342-343 
covariances,  343 
least  squares,  339-341 
properties  of  estimators,  341-342 
estimation  of  X,  342 
fixed  x’s,  337-349 
Gauss-Markov  theorem,  341 
model,  337-339 
assumptions,  339 

corrected  for  means  (centered),  342-343 
random  x’s,  358 

regression  coefficients,  matrix  of  (B),  88, 

338 

subset  selection,  351-358 
all  possible  subsets,  355-358 

criteria  for  selection  (R^,  S^,  C^), 
355-358 

stepwise  procedures,  351-355 
partial  Wilks’  A,  352-354 
subset  of  the  x’s,  351-353 
subset  of  the  y’s,  353-355 
tests  of  hypotheses,  343-349 
E  matrix,  339,  342-344 
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Regression,  multivariate  ( cont .) 
full  and  reduced  model: 
on  the  x’s,  347-349 
on  the  y’s,  353-355 
with  canonical  correlations,  375-376 
H  matrix,  343-344 
overall  regression  test,  343-347 
with  canonical  correlations,  375 
comparison  of  test  statistics,  345 
Lawley-Hotelling  test,  345 
Pillai’s  test,  345 
rank  of  B,  345 

Roy’s  test  (union-intersection),  344—345 
Wilks’  A  test  (likelihood  ratio),  344 
subset  of  the  x’s,  351-353 
with  canonical  correlations,  375-376 
subset  of  the  y’s,  353-355 
Repeated  data  set,  218 

Repeated  measures  designs,  204-221.  See  also 
Growth  curves 
assumptions,  204—207 
computation  of  test  statistics,  212-213 
contrast  matrices,  206,  208-221 
doubly  multivariate  data,  221 
higher  order  designs,  213-221 
multivariate  approach,  advantages  of, 
205-207 
one  sample,  208-211 

likelihood  ratio  test,  209-210 
and  randomized  block  designs,  208 
and  profile  analysis,  139 
several  samples,  211-212 
univariate  approach,  204—207 
Republican  vote  data,  53 
Research  units,  1 
Road  distance  data,  541 
Rootstock  data,  171 
Rotation,  see  Factor  analysis 
Roy’s  test  statistic: 
definition  of,  164—165 
table  of  critical  values,  574-577 

Sampling  units,  1 
Scalar,  6 

Scale  of  measurement,  1 
Scatter  plot,  50-51,  98,  105 
Seishu  data,  263 

Selection  of  variables,  233,  333-337,  351-358 
Singular  value  decomposition,  36,  522,  524, 
532-533 

generalized  singular  value  decomposition, 
522 

Size  and  shape,  402^403 
Skewness,  94—95,  98-99,  104 


Snapbean  data,  236 
Sons  data,  79 

Specific  variance,  see  Factor  analysis 
Spectral  decomposition,  35,  382,  416^418, 
505-506 

Squared  multiple  correlation,  see  R2 
Standard  deviation,  44 
Standardized  vector,  86 
Steel  data,  273 

Stepwise  selection  of  variables,  233,  335-337, 
351-355 

STRESS,  510-512 
Subvectors,  62-66 

conditional  distribution  of,  88 
covariance  matrix  of,  62-66 
distribution  of  sum  of,  88 
independence  of,  63,  87 
mean  vector,  62-64,  66 
tests  of,  136-139,  231-233,  347-349, 
353-359 

Summation  notation  (J^),  9 
Survival  data,  239-241 

t -tests: 

characteristic  form,  117,  122 
contrasts,  179 

equal  levels  in  profile  analysis,  145 
growth  curves,  224,  228 
matched  pairs,  132-133 
one  sample,  117 
paired  observations,  132-133 
repeated  measures,  210-211 
two  samples,  121-122,  127 
r2-statistic: 

additional  information,  test  for,  136-139 
assumptions  for,  122 
characteristic  form,  118,  123 
chi-square  approximation  for,  120 
computation  of,  130-132 
by  MANOVA,  130 
by  regression,  130-132 
and  F-distribution,  119,  124,  137-138 
full  and  reduced  model  test,  137 
likelihood  ratio  test,  126 
matched  pairs,  134-136 
one-sample,  117-121 
paired  observations ,  134-136 
and  profile  analysis,  139-148 
one  sample,  139-141 
two  samples,  141-148 
properties  of,  119-120,  123-124 
for  a  subvector,  136-139 
table  of  critical  values  for  T2,  558-561 
two-sample,  122-126 
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Taxonomy,  numerical,  see  Cluster  analysis 
Temperature  data,  269 
Tests  of  hypotheses: 
accepting  Ho,  118 

for  additional  information,  136-139, 

231-233,  347-349,  353-359 
partial  F-tests,  127,  138,  232 
covariance  matrices,  248-268 
one  covariance  matrix,  248-254 
independence: 

individual  variables,  265-266 
table  of  exact  critical  values,  590 
several  subvectors,  261-264 
two  subvectors,  259-261 

and  canonical  correlations,  260 
a  specified  matrix  2o,  248-249 
sphericity,  250-252 
uniformity,  compound  symmetry,  206, 
252-254 

several  covariance  matrices,  254—259 
Box’s  M-test,  257-259 

table  of  exact  critical  values, 

588-589 

on  individual  variables,  126-130 
Bonferroni  critical  values  for,  127 
tables,  562-565 

discriminant  functions,  126-132 
experimentwise  error  rate,  128-129 
partial  F-tests,  127,  232 
protected  tests,  128-129 
likelihood  ratio  test,  126.  See  also  Likelihood 
ratio  tests 

for  linear  combinations: 

one  sample  ( Ho :  C/jl  =  0),  117,  140-141, 
208-211 

two  samples  ( Ho :  Cm  =  Cm)’ 

142-143 
mean  vectors: 

likelihood  ratio  tests,  126 
one  sample,  X  known,  114-117 
one  sample,  X  unknown,  117-121 
several  samples,  158-173 
two-sample  F2-test,  122-126 
multivariate  vs.  univariate  testing,  1-2, 

112-113,  115-117,  127-130 
paired  observations  (matched  pairs), 

132-136 

multivariate,  134-136 
univariate,  132-133 
partial  F-tests,  127,  138,  232 
power  of  a  test,  113, 
protected  tests,  128-129 
on  regression  coefficients,  329-332, 

343-349 


on  a  subvector,  136-139,  231-233,  347-349, 
353-359 
univariate  tests: 

ANOVA  F-test,  156-158,  186-188 
one-sample  test  on  a  mean,  o  known, 

113 

one-sample  test  on  a  mean,  o  unknown, 
117 

paired  observation  test,  132-133 
tests  on  variances,  254-255 
two-sample  r-test,  121-122,  127 
variances,  equality  of,  254-255 
Total  sample  variance,  74,  383,  409,  418^119, 
427 

Trace  of  a  matrix,  30,  34,  69 
Trout  data,  242 

Two-sample  test  for  equal  mean  vectors, 
122-126 

Union-intersection  test,  164-165 
Unit: 

experimental,  1 
research,  1 
sampling,  1 

Univariate  normal  distribution,  82-83,  86 
Univariate  normality,  tests  for,  92-96 
D’Agostino’s  D-statistic,  96 
table  of  critical  values,  552 
goodness-of-fit  test,  96-97 
normal  probability  paper,  94 
Q-Q  plot,  92-94 
quantiles,  92-94,  97 
skewness  and  kurtosis,  94-95 
tables  of  critical  values,  549-55 1 
transformation  of  correlation,  96 

Variables,  1 .  See  also  Random  variables 
commensurate,  1 

dummy  variables,  173-174,  282,  315, 
376-377 

linear  combinations  of,  66-73 
Variance: 

generalized  sample  variance,  73 
pooled  variance,  121 
population  variance  (cr2),  44 
sample  variance  (s2),  44 
total  sample  variance,  74 
Variance-covariance  matrix,  see  Covariance 
matrix 

Variance  matrix,  see  Covariance  matrix 
Varimax  rotation,  434 — 435 
Vector(s): 

0  vector,  9 
definition  of,  6 
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Vector(s)  (cont.) 
distance: 

Mahalanobis,  76-77 
from  origin  to  a  point,  14 
between  two  vectors,  76-77 
geometry  of,  6 
j  vector,  9 
length  of,  14 
linear  combination  of,  19 
linear  independence  and  dependence  of, 
22 

normalized,  31 
notation  for  vector,  6 
observation  vector,  53-54 
orthogonal,  31,  50 
perpendicular,  50 
product  of,  14-16 
dot  product,  14 


rows  and  columns  of  a  matrix,  15-16 
standardized,  86 
sub  vectors,  62-66 
sum  of  products,  14 
sum  of  squares,  14 
transpose  of,  6-7 
zero  vector,  9 
Voting  data,  512 

Weight  gain  data,  243 
Wheat  data,  503 
Wilks’  A  test  statistic: 
definition  of,  161-164 
partial  A-statistic,  232 
table  of  critical  values,  566-573 
Wishart  distribution,  91-92 
Words  data,  154 
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